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Preface 


This treatise brings together results on aspects of statistical information, notably 
concerning likelihood functions, plausibility functions, ancillarity, and 
sufficiency, and on exponential families of probability distributions. A brief 
outline of the contents and structure of the book is given in the beginning of the 
introductory chapter. 

Much of the material presented is of fairly recent origin, and some of it is neu'. 
The book constitutes a further development of my Sc.D. thesis from the 
University of Copenhagen (Barndorflf-Nielsen 1973a) and includes results from a 
number of my later papers as well as from papers by many other authors. 
References to the literature are given partly in the text proper, partly in the Notes 
sections at the ends of Chapters 2-4 and 8-10. 

The roots of the book lie in the writings of R. A. Fisher both as concerns results 
and the general stance to statistical inference, and this stance has been a 
determining factor in the selection of topics. 

Figures 2.1 and 10.1 are reproduced from Barndorff-Nielsen (1976a and b) by 
permission of the Royal Statistical Society, and Figures 10.2 and 10.3 are 
reproduced from Barndorff-Nielsen f 1973c) by permission of the Biometrika 
Trustees. The results from R. T. Rockafeiiar's book Convex Analysis (copyright 
© 1970 by Princeton University Press) quoted in Chapter 5 are reproduced by 
permission of Princeton University Press. 

In the work I have benefited greatly from discussions with colleagues and 
students. Adding to the acknowledgements in my Sc.D. thesis, I wish here 
particularly to express my warm gratitude to Preben Blaesild, David R. Cox, 
Jorgen G. Pedersen, Helge Gydesen, Geert Schou, and especially Anders H. 
Andersen for critical readings of the manuscript, to David G. Kendall for helpful 
and stimulating comments, and to Anne Reinert for unfailingly excellent and 
patient secretarial assistance. A substantial part of the manuscript was prepared 
in the period August 1974-January 1975 which I spent in Cambridge, at 
Churchill College and the Statistical Laboratory of the University. I am most 
grateful to my colleagues at the Department of Theoretical Statistics, Aarhus 
University, and the Statistical Laboratory, Cambridge University, and to the 
Fellows of Churchill for making this stay possible. 

Aarhus, May 1977 O. B. -N. 
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CHAPTER 1 


Introduction 

1.1 INTRODUCTORY REMARKS AND OUTLINE 

The main kinds of task in statistics are the construction or choice of a statistical 
model for a given set of data, and the assessment and charting of statistical 
information in model and data. 

This book is concerned with certain questions of statistical information 
thought to be of interest for purposes of scientific inference. It also contains an 
account of the theory of exponential families of probability measures, with 
particular reference to those questions. Besides exponential families, the most 
important type of statistical models are the group families, i.e. families of 
probability measures generated by a unitary group of transformations on the 
sample space. However, only the most basic facts on group families will be 
referred to. (Some further introductory remarks on these two types of models are 
given in Section 1.3.) Another limitation is that asymptotic problems are not 
discussed, except for a few remarks. 

The reader is supposed to have a fairly broad, basic knowledge of statistical 
inference, and in particular to be familiar with the more conceptual aspects of 
likelihood and plausibility, such as are discussed in Birnbaum (1969) and 
Barndorff-Nielsen (1976b), respectively. 

Probability functions, likelihood functions, and plausibility functions are 
charts of different types of statistical information. They are the three prominent 
instances of the concept of ods functions, due to Barnard (1949). An ods function 
is a real function on the space of possible experimental outcomes or on the space 
of hypotheses, which expresses the relative "credibility’ of the points of the space in 
question. It is often convenient to work with the logarithms of such functions and 
these are termed lods functions. For the objectives of this treatise the interest in 
lods (or ods) functions lies mainly in the very concept which is instrumental in 
bringing to the fore the duality between the sample aspect and the parameter 
aspect of statistical models, and in constructing prediction functions. Thus, 
although the concept of lods function will be referred to at a number of places, the 
theoretical developments relating to lods functions and presented in Barnard 
(1949) are not of direct relevance in the present context and will only be indicated 
briefly (in Section 3.1). 
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2 Introduction 

Generally, only part of the statistical information contained in the model and 
the data is pertinent to a given question, and one is then faced with the problem of 
separating out that part. The key procedures for such separations are margining 
to a sufficient statistic and conditioning on an ancillary statistic. Basic here is the 
concept of nonformation, i.e. the concept that a certain submodel and the 
corresponding part of the data contain no (accessible) pertinent or relevant 
information in respect of the question of interest. 

A general treatment of the topics of statistical information indicated above is 
given in Part I, while the theory of exponential families is developed in Part III. 
Properties of convex sets and functions, in particular convex duality relations, are 
of great importance for the study of exponential families. Since much of convex 
analysis is of fairly recent origin and is not common knowledge, a compendious 
account of the relevant results is given in Part II, together with properties of 
unimodality and Laplace transforms. A reader primarily interested in lods 
functions and exponential families may concentrate on Chapters 2, 3, 8, and 9, 
just referring to Part 11, which consists of Chapters 5-7, as need arises. Inferential 
separation, hereunder notably nonformation, ancillarity, and sufficiency, is 
discussed in Chapters 4 and 10. The chapters of Parts I and III contain 
Complements sections where miscellaneous results which did not fit into the 
mainstream of the text have been collected. 

Each known methodological approach, of any inclusiveness, to the questions 
of statistical inference is hampered by various difficulties of logical or epistemic 
character, and applications of these approaches must therefore be tempered by 
independent judgement. The merits of any one approach depend on the extent to 
which it yields sensible and useful answers as well as on the cogency of its 
fundamental ideas. 

The difficulties, of the kind mentioned, connected with likelihood, plausibility, 
ancillarity, and sufficiency have been discussed in Bimbaum (1969), 
Barndorff-Nielsen (1976b), and numerous other papers. Many of these papers 
will be referred to in the course of this treatise, but a comprehensive exposi- 
tion of the arguments adduced will not be given. One of the difficulties, whose 
seriousness seems to have been overestimated, is that different applications 
of ancillarity and sufficiency, to the same model and data, may lead to different 
inferential conclusions (cf. Section 4.7(vi)). However, as has been stressed and well 
illustrated by Barnard (1974b), it is in general impossible to obtain unequivocal 
conclusions on the basis of statistical information. It is therefore not surprising 
that if uniqueness in conclusions is presupposed as a requirement of inference 
then paradoxical results turn up, such as is the case with Birnbaum’s Theorem 
(Section 4,7(v)). 


1.2 SOME MATHEMATICAL PREREQUISITES 

Let M be a subset of a space 2R. The indicator of M is the function 1 defined by 
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, , , f 1 for X € M 
0 for X G 


where is the complement 9K\M of M. If SOI is a product space, SOI = SOli x SOI2 , 
and if x^ eSOli then is the section of M at Xj, i.e. = {x 2 :{xi,X 2 )gM}. 
When SOI is a topological space the interior, closure, and boundary of M are 
denoted by int M, cl M, and bd M, respectively. Suppose SOI = R^. The affine hull 
of M is written aff M, and dim M is the dimension of aff M. An affine subset of M is 
a set of the form M nL where L is an affine subspace of R^, 

For any mapping / the notations domain / and range / will be used, 
respectively, for the domain of definition of / and the range of /, and / is said to be 
a mapping on domain /. 

If X is a real number then [x} will stand for x — 1 or x provided x is an integer 
and for [x], the integer part of x, otherwise. Furthermore, the notations 
N= {1,2,...}, iVo = {0,1,2,...}, andZ= {..., -2, -1, 0, 1,2,...} are adopted. 

All vectors are considered basically as row vectors, and the length of a vector x 
is indicated by lx]. A set of vectors in R^ are said to be affinely independent 
provided their endpoints do not belong to an affine subspace of R^. The transpose 
of a matrix A is denoted by A' and, for A quadratic, | A| is the determinant and tr A 
is the trace of A. The symbols I or are used for the r x r unit matrix (r = 1,2,...). 
Occasionally an r x r symmetric matrix A with elements a,j, say, will be 

interpreted either as a point in R'' or as the point in R'^ ^ ^ whose coordinates are 
given by ((^11,^22 . Let H be a positive 
definite matrix, set A = and let 


pll 

S 12 ' 

and A = 

(An 

Ai2\ 

\£21 

^22; 


Ia2i 

^22/ 


be similar partitions of L and A. Then, as is well known, 

( 1 ) A 22 ^ = ^22 ~ ^21^11^^12 

(2) = 

When indexed variables, as for example x^J = l,...,m, or x,j, i 
7 = 1, . . . , ?2, are considered the substitution of a dot for an index variable signifies 
summation over that variable. Furthermore, the vector (xi,...,xj will be 
denoted by x,,,, the vector (Xii,...,Xi„) by etc. 

Consider a real-valued function / defined on a subset X of R^. The notations 
Df = df/dx and d^fldx'dx are used for the gradient and the matrix of second 
order derivatives of /, respectively, while D% where i = (ij , . . . , y is a vector of 
non-negative integers, indicates a mixed derivative of /. (Thus Df = 
the case where a partition (x^^^ . . . , x^"”^) of x(6X) is given then the (z, 7)th matrix 
component of the corresponding partition of d^fldx'dx is denoted by 
d^fldx^^^'dx^K Let h be a twice continuously differentiable mapping on an open 
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subset 9] of /?*• and onto 1. also assumed open, and set 


dx _ dh 


idh 

Syi 




the Jacobian matrix of h. Moreover, set 

d^h Id^h, S\ \ 

dy'dy dyfdy \dy'dy^ ^Sy'dyj 


If/is twice continuously differentiable then, writing/for the composition/o h off 
and k one has 


dy'dy dy' dx'dx By dx dy'dy 


where * is a matrix multiplication symbol defined in the following way. For a 1 x /c 

vector r = and an m x n/c matrix A = [A^ AJ, A^ being m x n 

a = l,,..,fe), the product v-A is given by 

V»A = UjAj -I- . . . -f. 

(Thus the operation • is a generalization of the ordinary inner product of two /c- 
dimensional vectors.) 

Measure-theoretic questions concerning null sets, measurability of mappings, 
etc., will largely be bypassed. (Section 4.2, however, forms something of an 
exception to this.) The mathematical gaps left thereby may be filled out by 
standard reasoning. 

Lebesgue measure will be denoted by 2, counting measure by v. (The domains 
of these measures vary from case to case but it will be apparent from the context 
what the domain is.) 

Let H be a class of transformations on a space 36, i.e. the elements of H are one- 
to-one mappings of 3E onto itself. The class H is unitary, respectively transitive, if 
for every pair of points x and x in 33 the equation x = h{x) has at most, respectively 
at least, one solution h in H. In the case where H is transitive, the set 
H(x) = {h{x):heH] is equal to 36. A measure /x on a a-algebra 21 of 36 is 
transformation invariant under H\ffih = fL for every heH, where yih is defined by 
IJ.hiA) - Ae^. Suppose H is a group under the operation o of 

composition of mappings. Then H{x) is called the orbit of x and the orbits form a 
partition of I. If, in addition, H is unitary then each orbit can be brought into one- 
to-one correspondence with H, and thus 36 can be represented as a product space 
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II X 2) of points (w, v) such that u determines the orbit and v the position on that 
orbit of the point x in X corresponding to (w, v). As is well known (see e.g. Nachbin 
1 965), if F is a locally compact, topological group then there exist left invariant as 
well as right invariant measures on H. For F unitary and transitive, these 
measures can, by the above identification of 36 and F, also be viewed as 
transformation invariant measures on 36. 

The sample spaces to be considered are exclusively Euclidean, i.e. they are Borel 
subsets of Euclidean spaces, and the associated cr-algebras of events are the 
classes of Borei subsets of the sample spaces. Moreover, all random variables and 
statistics take values in Euclidean spaces. Generally, the letter 36 will be used to 
denote the sample space, and x is a point in 36. 

Ordinarily, the same notation — a lower case italic letter — will be used for a 
random variable or statistic and for its value, the appropriate interpretation being 
determined by the context. In cases where clarity demands a distinction the 
mapping is denoted by the capital version of the letter. 

Let 36 ( c jR^) be a sample space, 21 the cr-algebra of Borel subsets of 36, and ^ a 
family of probability measures on 36. The triplet (36, 91, is termed a statistical 
field. Let P be a member of ip and let t (also T) be a statistic. 

The marginal distribution of t under P has probability measure Pt given by 
Pt{B) = P(t~ ^{B)) for Borel sets B. Further, Ept and Vpt stand for the mean value 
(vector) and the variance (matrix) of t. For an event A with P{A) > 0 the 
conditional probability measure given A is denoted by P(-\A) or P^. If S is a sub- 
c7-algebra of 91 then P© denotes the restriction of P to © and P® is the Markov 
kernel of the conditional distribution given © under P. The conditional mean 
value given © under P of a random variable y is written Epy. When © is the cr- 
algebra generated by a statistic t the notations Pt, P^ or P{-\t), and Epy are 
normally used instead of P®, P®, and Efy. For any measure n on 36, let indicate 
the measure on the product space .T” which is the n-fold product of pt with itself, 
and let be the w-fold convolution of jjl (provided it exists). Set — {P^:P 
e - {F :P 6 e ^}, etc. 

If P and Q are two probability measures on 36 having common support then 


(4) 

dP, „ dP 
lQ,~^^dQ 

and 


(5) 

dP(-jt) _ dPIdQ 


dQi^t) dPJdQ; 


A distribution on is singular if its afiine support (i.e. the aifine hull of its 
support) is a proper subset of P*. Let u and v be statistics. The conditional 
distribution of u given v and under P is singular provided that the marginal 
distribution of u under the conditional probability measure given v is singular 
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with probability 1, i.e. 

P{Pu{-\v) is singular} = L 

The probability measure P is said to be of discrete type if the support S of P has 
no accumulation points, of c-discrete type if S equals the intersection of the set 
and some convex set, and of continuous type if P is absolutely continuous with 
respect to Lebesgue measure 1 on X. The same terms are applied to ^ provided 
each member of "ip has the property in question. 

A function ij/ defined on ^ and taking values in some Euclidean space is called a 
parameter function. As with random variables and statistics, the same notation ij/ 
will generally be used for the function and its value, but when it seems necessary to 
distinguish explicitly the function is indicated by ij/f). Suppose ^ is given as an 
indexed set, = {P^: me Q), then $ is called parametrized provided O is a subset 
of a Euclidean space and the mapping co-^P^ is one-to-one. Any parameter 
function tj/ on ^ may be viewed as a function of o), and its values will be denoted, 
freely, by as well as by ^ or ^(PJ. Similarly for other kinds of mappings. 

The family $ is said to be generated by a class H of transformations on 3E if for 
some member P of ^ one has “ip = {Ph: hEH}. In the case where if is a unitary 
group the family ^ will be called a group family. Suppose that “ip is a group family 
and that u is a statistic which is constant on the orbits under H but takes different 
values on different orbits (thus w is a maximal invariant). Then u is said to index 
the orbits, and the marginal distribution of u is the same under all the elements of 

It is also to be noticed that iTip is a transitive group family (i.e. a group family 
with H transitive) and if /i is a left or right invariant measure on X which, when 
interpreted as a transformation invariant measure on 3E, dominates ^ then the 
family p of probability functions or densities of ^ relative to p is of the form 

p = {p{h~f-)):heH} 

where p = dPjdp. 

For the discussions in Parts I and III (except Section 3. 1) it is presupposed that 
a statistical model, with sample space X and family of probability measures % has 
been formulated. Unless explicitly stated otherwise, it is moreover supposed that 
^ is parametrized, ^ = {Pc,:q) 6Q}, and determined by a family of probability 
functions p = {p(-;co): meO), i.e. p(*; m) is the density of P^ with respect to a 
certain <j-finite measure p which dominates For discrete-type distributions this 
dominating measure is always taken to be counting measure, so that p{x; co) is the 
probability of x. (In subsequent chapters certain topics in plausibility inference 
Will be considered. Whenever this is the case, it is— for non-discrete 
distributions— presupposed that sup^ p{x; co) < oo for every co g Q.) In the case p 
IS of the fom (6), for some probability function p with respect to p and some class 
H of transformations on 3E, then p is said to be generated by H. The points xinX 
tor which p(x; co) > 0 for some coeQ are called realizable, and the realizable 
values of a statistic are the values corresponding to realizable sample points x 
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Viewed as a function on 3E x Q, /?(•; •) is referred to as the model function. The 
notation p{u; co\t) is used for the value of the conditional probability function of a 
statistic u given t and under co. 

From the previous discussion it is apparent that if the parametrized family 
^ = [P^ 6 fl} is a group family under a group H of transformations on the 
sample 3£ then, under mild regularity assumptions, 3t can be viewed as a product 
space U X 5B, the spaces 93, H, and Q may be identified, and ^ has a model 
function of the form 

p(x; co) = p(u)p(co“Hf)|M) 

in an obvious notation. 

The r-dimensional normal distribution with mean (vector) ^ and variance 
(matrix) L will be indicated by iV;.(6Il), and 9^1^ will stand for the class of these 
distributions. (The index r will be suppressed when r = 1.) The precision (matrix) 
A for iV^(C, 'L) is the inverse of the variance, i.e. A = L" \ The probability measure 
of Nrii, li) will be denoted by or according as the parametrization of 91^ 
by (^, E) or by (^, A) is the one of interest. 

The symbol ► designates the end of proofs and examples. 


1.3 PARAMETRIC MODELS 

The statistical models considered in this tract are nearly all parametric and 
determined by a model function p{x; co). Rather more attention than is usual will 
be given to the parametric aspect of the models, i.e. to the variation domains of the 
parameters and subparameters involved and to the structure of p{x: co) as a 
function of co. Thus the observation aspect and the parameter aspect are treated 
on a fairly equal footing. There are several reasons for this. The most substantial 
is that the logic of inferential separation cannot be built without certain precise 
specifications of the role of the parameters. Secondly, it is natural in connection 
with a comparative discussion of likelihood functions and plausibility functions 
to give an exposition of Barnard’s theory of lods functions, and in a considerable 
and fundamental portion of that theory observations and parameters occur in a 
formally equivalent, or completely dual, way. Finally, the stressing of the 
similarity or duality of the observation and parameter aspects, as far as is 
statistically meaningful, leads to a certain unification and complementation of 
the theoretical developments. 

There are two main classes of parametric models: the exponential families and 
the group families. The exponential families, the exact theory of which is a main 
topic of this book, are determined by model functions of form 

p{x; co) = a{co)b{x)Q^'^ 

where 0 is a /c-dimensional parameter (function) and t is a /c-dimensional statistic. 



8 Introduction 

Group families typically have model functions which may be written 
p(x;co) = p{u)pi(D~H^)\ul 

as explained in Section 1.1. A theory of group families — the theory of structural 
inference— has been developed by Fraser (1968, 1976) (see also Dawid, Stone and 
Zidek 1973) from Fishefs ideas on fiducial inference. Although the core of 
fidudal/structural inference is a notion of induced probability distributions for 
parameters which few persons have found acceptable, the theory comprises many 
results that are highly useful in the handling of group families along more 
conventional lines. 

The overlap between the two classes of families is very little; thus in the case co is 
one-dimensional, the only notable instances of families which belong to both 
classes appear to be provided by the normal distributions with a known variance 
and the gamma distributions with a known shape parameter (cf. Lindley 1958, 
Ffanzagl 1 972, and Hipp 1 975). Moreover, essential distinctions exist between the 
mathematical-statistical analyses which are appropriate for each of the two 
classes. It is remarkable indeed that both classes and the basic difference in their 
nature were first indicated in a single paper by Fisher (1934). 

Each class covers a multitude of important statistical models and allows for a 
powerful general theory. This strongly motivates studying these classes per se and 
choosing the model for a given data set from one of the two classes, when feasible. 
Once this is realized, it seems of secondary interest only that one may be led to 
consider, for instance, exponential families by arguing from various viewpoints of 
a principled character, such as sufficiency, maximum likelihood, statistical 
mechanics, etc. (see the references in Section 8.4), especially since each of these 
viewpoints and its consequences only encompass a fraction of what is of 
importance in statistics. 



PART 

I 


Lods Functions and Inferential Separation 


Log-probability functions, log-likelihood functions and log-plausibility fun- 
ctions are the three main instances of lods functions. It is an essential feature of the 
theory of lods functions that it incorporates a considerable part of the statistically 
relevant duality relations which exist between the sample aspect and the 
parameter aspect of statistical models. 

Separate inference is inference on a parameter of interest from a part of the 
original model and data. Margining to a sufficient statistic and conditioning on 
an ancillary statistic are key procedures for inferential separation. 




CHAPTER 2 

Likelihood and Plausibility 


In this short chapter important basic properties of likelihood functions and 
plausibility functions are discussed, with particular reference to similarities and 
differences between these two kinds of function. As a preliminary, the definition 
and some properties of universality are presented. Universality will also be of 
significance in the discussions, in subsequent chapters, of prediction, inferential 
separation, and unimodality. 


2.1 UNIVERSALITY 

The concept of universality is of significance in the discussions, given later in the 
book, on plausibility, M>ancillarity, prediction and unimodality. 

The probability function p{-; co) is said to have a point x as mode point if 

pix;co) = sup pix;co). 

X 

and the set of mode points of p(*; co) will be denoted by x(m). More generally, x 
will be called a mode point for the family p provided that for all 8 > 0 there exists 
an CO 6 Q such that 

(1) (1 + £) p(x; o>) > sup p(x; co). 

X 

With this designation universality of the family p is defined as the property that 
every realizable x is a mode point for p. If, in fact, every realizable x is a mode 
point for some member of p then p is called strictly universal 

For convenience in formulation, universality and strict universality will 
occasionally be spoken of as if they were possible properties of the family of 
probability measures $ rather than of p. Thus, for instance, is universal’ will 
mean that the family p of probability functions determining ^ is universal. 

A family p for which sup^p(x; co) is independent of co will be said to have 
constant mode size. 

Most of the standard families of densities are universal, and many examples of 
universal families will be mentioned later on. Clearly, one has: 


Lemma 2,1, Let H be a class of transformations on X. Suppose H is transitive and 

11 
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that for some WosQ 

P = heH}. 

Then p is universal mth constant mode size. 

As a simple consequence of the definition of mode point one finds: 

Theorem 2.1. Let i he a statistic and let 

p{x;co) = pit;a))p{x;co\t) 

be the factorization of the probability function of x into the marginal probability 
function for t and the conditional probability function for x given t. 

Suppose Xo is a mode point of p and set to = t{xo). Then Xq is a mode point of the 
family of conditional probability functions 

{p{-;<o\to):coeQ}. 

Corollary 2J. If p is universal then for any given value of t the family of conditional 
probability functions 

{pi--,co\t):coeQ} 

is aiso universal. 

Furthermore, it is trivial that if p has only a single member p, say, then p is 
universal if and only if p is constant, i.e. the density is uniform. 

The family p will be said to distinguish between the values of x if for every pair x' 
and x” of values of x there exists an co g Q such that 

p(x'\ co) i- p{x"; (d). 

If p is universal and distinguishes between the values of x then, under very mild 
regularity conditions, x is minimal sufficient. To see this, let x' and x" be realizable 
points of J and suppose that 

c'p(x'; co) — c"p(x"; co) for every cogQ 

where c' and c" are constants (which may depend, respectively, on x' and x"). By 
the universality of p, the ratio c"/c' must be 1, and this implies x' = x" since p 
distinguishes between values of x. In other words, the partition of 3£ induced by 
ihe likelihood function is (equivalent to) the full partition into single points; the 
result now follows from Corollary 4.3. 


2.2 LIKELIHOOD FUNCTIONS AND PLAUSIBILITY FUNCTIONS 

A brief, comparative discussion of basic properties of likelihood and plausibility 
functions is given here. 

Both likelihood and plausibility functions are considered as determined only 
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up to a factor which does not depend on the parameter of the model. However, 
unless otherwise stated, the notations L and U will stand for the particular 
choices 

L(a)) = L((jo; x ) = p(x; co) 
n{o)) = 17(0); x) = p(x; co)/ sup p(x; co) 

a: 

of the likelihood and plausibility functions, based on an observation x. 

It is important to note that L and FI differ only by a factor 

5(co) = sup p(x;co) 


which is independent of x. 

This implies that ln/7(co;x), as well as lnL(co;x), is a b-lods function 
corresponding to the f-lods function lnp(x; co) — in the terminology of Barnard’s 
(1949) fundamental theory of lods. Some important common properties of 
likelihood and plausibility functions may be derived naturally in the theory of 
lods functions (see Section 3.2). 

The normed likelihood and plausibility functions will be denoted by L and H, 
i.e. 

L(co) = L(co)/supL(co) 

ti> 

n{co) =i7(o;)/sup n{o)). 

(O 

Clearly, 

sup i7(co; x) = 1 

CD 

if and only if x is a mode point of p. Hence, i7= il for every x if and only if p is 
universal. 

For any family p such that sup^p(x; cu) < oo for every x one has 
L(co; x) = s(co)r{x)n{(jo; x) 

where 

r(x) = sup n{co;x)/s\ip p { x ;( d ). 

CD CO 

If L and /I are equal for a given value of x then s{co) must be independent of co on 
the set {co: p{x; co) > 0} which means that p has constant mode size on that set. 
On the other hand, constant mode size of p obviously implies that L = i7 for 
every x. In particular, L and U are thus equal for every x if p is generated by a 
transitive set of transformations. 

The set of maximum points of the likelihood or plausibility function constitute 
respectively the maximum likelihood estimate co(x) and the maximum plausi- 
bility estimate a)(x) of the parameter co, i.e. 
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(b = cb(x) = {(d: L(oj) = sup L(oj)} 

CO 


(b = co(x) = {co: nioj) = sup JI(a))}. 

CD 

Example 2.L Figure 2.1 shows the normed likelihood and plausibility functions 
for the binomial model 

71^(1 ~ 

when n equals 1 or 3 and x = 1. 

The most prominent difference between L and i7 ( = i7) in the two cases is that 
L takes its maximum at one point only, = 1 respectively i while i7, as is typical 
with discrete models, is 1 on a whole set, tt = [i 1] respectively [i^]. ► 

The plausibility function is not, in contrast to the likelihood function, 
independent of the stopping rule. This is illustrated by the following example. 


( 1 ) 


p{x; n) = 



Figure 2.1 The (normed) likelihood and plausibility functions 
corresponding to the observation x = 1 of a binomial variate with 
trial number n = 1 or 3 
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Example 22. If x follows a binomial distribution with n = 2 and if x is observed to 
be 0 then the plausibility function is given by 


iJ(7r) = 


1 

1 — % 
7m " 

1 


71 


for 0 < 71 < 
for ^ < 71 < f 


for § < 71 < 1. 


Suppose, on the other hand, that t has the negative binomial distribution 
p(x; 7 c) = (x 4- 1)(1 — 7c)^7i:^ 
and that, again, x = 0. Then 


/7(7r) = (s+ l)-'7i"" 


for 


S 5+1 

< 7C < -, 

5+1 5 + 2 


5 = 0,1, 2 


However, in both instances 

L(7r)=(l -nf. > 

Note that the relation xgx(co) entails coGa3(x), and that the converse impli- 
cation is true if (and only if) x is a mode point for p. 

If the maximum plausibility estimates corresponding to two different values, x' 
and x", of the variate x have a point in common then dp(x!\ oS) = c"p{x"; co) where 
c' = l/sup^i7(a); x') and c" = l/sup^^ i7(a); x"). Thus, provided x is minimal 
sufficient with respect to {p(*; co): cogQo} for any open subset Qq of Q, the 
estimates co(x') and co(x") will, in general, have at most boundary points in 
common (cf. Corollary 4.3). 

For discrete data, the set of possible values of the maximum likelihood 
estimator is nearly always a proper, and small, subset of Cl. Thus, for instance, if x 
follows the binomial distribution (1) then n{X) = {i/n: z = l,2,...,n — 1} while 
the domain of tc is (0, 1). In other words, the likelihood approach has the feature 
that ordinarily, for discrete models, most of the parameter values are considered 
not to give the best explanation of the observation x for any value of the latter. In 
contrast, c5(3£) does, whenever X is finite and otherwise as a rule, equal Cl. 

The maximum plausibility estimator for a truncated model is often a simple 
modification of the estimator for the full model. 


Example 2.3. Let it and tiq denote, respectively, the maximum plausibility 
estimators for the binomial model and its zero-truncation 

- (1 - x = 1 n. 

Here rtoix) = 7t(x) for x > 1 and 7ro(l) = *( 0 ) u 7t(l). ► 
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The example also illustrates that d) may have a simple explicit expression in 
cases where no such expression exists for cb. A further illustration of this is 
provided by the r x c contingency table with both marginals given, cf. Example 
4.19. 


2.3 COMPLEMENTS 

(i) Finucan (1964) has given the following characterization of mode(s) of the 
multinomial distribution 


(i) 


Xi ! . . . 




Let n denote a non-negative real number. Any point ([nTTi},. .., [n 7 r^}), for 

which [njii} + h [mm] = n, is a mode of the multinomial distribution (1), and 

such a point exists. If none of the numbers niii , . . . , is an integer then the mode 

is unique. 

A proof of this result will be given in Example 4.17. 

(ii) Suppose p is a transitive group family (cf. Section 1.1). For any x the normed 
likelihood and plausibility functions L and il are then identical. Moreover, for 
any constant d the set [o: L{m) ^ d} — {u): Hia)) ^ d} is a confidence set for o) 
whose confidence coeflBcient is given by the integral of L/c = Tile over the set in 
question and with respect to right invariant measure, c being the norming 
constant which makes the integral over all of equal to 1. 

This is a particular instance of the following result. If a fixed subset of 3E = Q 
is, in repeated sampling, transformed by each observation x into xA~'^ then the 
frequency of cases in which xA" ^ contains the actual value of co will tend to P{A) 
as the sample size increases to infinity, and 



pdfj. 


may be rewritten as 


P(^> = A(x) L(-;x)dv 
dx.4 - 1 

where v denotes the right invariant measure on Q and A(x) is a norming constant 
(actually, A(*) is the so-called modular function, which is determined by pi{dh) 
= A{h)v{dh)), 


2.4 NOTES 

It seems superfluous to make any bibliographical notes on likelihood here, except 
perhaps to draw the historically interested reader’s attention to the account by 
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Edwards (1974) of the origination of likelihood in Fisher’s writings and of earlier 
related ideas. 

The concepts of universality and plausibility were introduced in 
Barndorff-Nielsen (1973a) and (1976b) respectively, and the material of the 
present chapter has been taken largely from the latter paper. 

The inverse of the mode mapping 3c, which to any observed x assigns the values 
of 0 ) such that p(-; ca) has x as mode, was propounded as an estimator of co by 
Hoglund (1974), This estimator was termed the exact estimator by Hoglund and 
the maximum ordinate estimator in Barndorff-Nielsen (i976b). It may be noted 
that when the maximum ordinate estimate exists then it equals the maximum 
plausibility estimate. Thus, in particular, for universal families the maximum 
ordinate and the maximum plausibility estimators are identical. 




CHAPTER 3 


Sample-Hypothesis Duality and 
hods Functions 


Statistical models (3E, p(* ; •), Q) have a sample aspect and a hypothesis aspect. When 
considering the sample aspect one thinks of the family {p(*; o): of 

probability functions on X and of the probabilistic properties of x embodied in 
this family. The hypothesis aspect concerns the evidence on co contained in the 
various possible observations x, and embodiments of this evidence are given by 
the family {p(x;*): xg3B} of likelihood functions and the family 
{p(x; ’)/siipxp{x; ‘):xe X} of plausibility functions. 

There is a certain degree of duality between the sample aspect and the 
hypothesis aspect in that various notions and results in either aspect have, partly 
or completely, analogous counterparts in the other. This is the sample-hypothesis 
duality. 

To take a primitive example, one is interested in the ‘position’ and ‘shape’ of 
likelihood and plausibility functions as well as of probability functions, and the 
position, for instance, is often indicated by the maximum point or set of maximum 
points for the function in question. Another primitive example is provided by the 
vector-valued functions on X and Q, i.e. respectively the statistics and the 
subparameters. The indicator functions, in particular, correspond to events and 
hypotheses. 

More interestingly, stochastic independence has a significant analogue in the 
concept of likelihood independence which is discussed in Section 3.3. 

The extent of the duality varies in some measure with the type of model 
function. For linear exponential model functions 

p{x;co) = a(co)h(x)e®'''' 

the duality is particularly rich in structure, as will become apparent in Part III. 
(Obviously, this form of model function in itself strongly invites mathematical 
duality considerations. Moreover, it so happens that for exponential models the 
mathematical theory of convex duality, summarized in Part II, fits closely with 
the sample-hypothesis duality.) 


19 
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A study of how far the sample-hypothesis duality goes, or could be brought to 
so, at a basic, axiomatic level was made by Barnard (1949); see also Barnard 
ri972a, 1974a). A brief account of the theory he developed is given in Section 3.1, 
with the emphasis on his concept oflods functions, particular instances of which 
are log-probability functions, and log-likelihood and log-plausibility functions. 

By a suitable combination of lods functions it is possible to obtain various 
types of prediction functions, as will be discussed in Section 3.2. These prediction 
functions have a role in predictive inference which is similar to the role of log- 
likelihood and log-plausibility functions in parametric inference. 


3.1 LODS FUNCTIONS 

Barnard (1949) established an axiomatic system which he proposed as a basic 
part of statistical inference theory; see also Barnard (1972a, 1974a). The notions 
and basic properties of probability (and likelihood) are not presupposed for this 
system, and the addition and multiplication rules satisfied by probabilities are 
not used as axioms, but are derived at a fairly late stage of the development. In the 
earlier stages the so-called lods functions are introduced and studied. Log- 
probability, log-likelihood, and log-plausibility functions are the three sub- 
stantial examples of this kind of function. It is with the theory of lods functions 
that we shall be concerned here. 

The theory deals with pairs of sets, and Q, say, and with certain types of 
connections between X and Q. X is to be thought of as the set of possible outcomes 
of an experiment, while Q is the set of hypotheses about the experiment. Each 
hypothesis is supposed to determine a complete ordering of the points of X and, 
dually, each outcome of the experiment determines a complete ordering of O. The 
extra-mathematical meaning of these orderings is that they give the ranking of the 
points of 3E, respectively Q, according to how ‘likely’ or ‘plausible’ the points are 
under the hypothesis, respectively outcome, in question. The supposition that the 
outcomes determine orderings amounts, in the words of Barnard (1949), ‘to 
assuming that a theory of inductive inference is possible’. Let the orderings 
induced by the hypotheses and the outcomes be called f-orderings and 
b-orderings, respectively, (f and b stand for forward and backward.) 

The ideas of independent experiments and conjunctions of such experiments 
together with their respective hypothesis sets are reflected in the theory as a 
number of axioms which specify simple consistency relations between the 
f-orderings of experiments and their independent conjunctions as well as the 
exact same consistency relations for the b-orderings. 

Adding two dual (Archimedean) axioms for the orderings makes it possible to 
show that to each pair X, Q there exist real-valued fu£ictions/(x; co) and b{(D; x) of 
xeXf (ogQ such that the f-ordering determined by any co is the same as the 
ordering of X induced by/(-; co), and similarly for b. The function/(-; co) on X is 
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termed an f-lods function while b(x ; •) is termed a b4ods function. Any such 
function is a lods function. Furthermore, f- and b-lods functions for the 
conjunction of a number of independent experiments are obtained by addition of 
the f- and b-lods functions of the component experiments. 

Finally, an assumption is introduced which relates the two kinds of orderings. 
In effect (see Barnard 1949) it is equivalent to the requirement that there exist 
functions d on X and h on Q such that 

f{x:cD) -h h{(D) — b{a):x) + d(x). 

This crucial relation, called the inversion formula, is the bridge which indicates 
how evidential rankings of the hypotheses are induced by the observed outcome x 
and by the family of f-functions given by the hypotheses. 

The theory is exemplified, of course, by taking a family 

p = {p(*;co): coeQ} 

of probability functions on and setting 

/ (x; co) = in p(x; cj) — b{(o; x) 

whence 


h{co) = d{x) = 0. 

The b-lods functions are then the log-likelihood functions determined by p. 
Indeed, excepting situations where co is considered as having a prior distribution, 
it may fairly be said that up till recently this was the only real example, although 
the possibility of choosing h{co) different from 0, with the accompanying change of 
b((o; x), was touched upon in Barnard (1949, 1972a). However, another example is 
now furnished by plausibility inference through the choices 

fix;co) = lnp(x;m), 

b((o; x) = in {p{x; co)/sup p{x; co)}, 

X 

and 


h{co) = — In sup p(x; co), d{x) = 0. 

Two lods functions are considered as equivalent if they are equal up to an 
additive constant. This accords with the standard practice of disregarding factors 
of a likelihood function which depend on the observations only. For purposes of 
comparison, etc. of non-equivalent lods functions it is convenient to introduce a 
norming of lods functions, i.e. to select a representative from each equivalence 
class in some suitable way. Following the usual manner in which likelihood 
functions are normed, a lods function will be said to be normed if its supremum 
equals 0. 
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If f is an f-lods function and fo is a b-Iods function then F = exp / is called an 
fods function and B = Qxpb is called a b-ods function. The ods functions F and B are 
normed if they have supremum 1. 

When a given famih of f-lods functions is to be inverted into a family of b-lods 
functions, two elemental procedures, which do not require introduction of 
acceptability functions that would have to be motivated by further principles, are 
possible: direct inversion, or norming and then direct inversion (where by direct 
inversion is meant an application of the inversion formula with h(o}) = d(x) = 0). 
Taking {lnp(-;cu): cueQ} as the family of f-lods functions, these two procedures 
yield* the families of, respectively, log-likelihood and log-plausibility functions. 

Let/(x; w) be taken as In p{x; co\ let /i(m) and d(x) be arbitrary, and suppose that 
different values of w correspond to different probability measures on X. Then 
different values of (d do also determine different and non-equivalent f-lods 
functions but, in general, different x values may give equivalent b-lods functions. 
Here then is a lack of duality which in itself directs the attention to those cases 
where no two members of the family of b-lods functions are equivalent. This latter 
condition means that x is minimal sufficient. To see this, consider the partition of X 
generated by the family of b-lods functions, i.e. the partition for which two points 
X and X of X belong to the same element of the partition if and only if 6(-; x) is 
equivalent to b(-;x). This partition is the same whatever the choice of the 
functions h{co) and d(x), and hence equals the likelihood function partition of X 
which is minimal sufficient (cf. Section 4.2). 

Given a family {ib(* ;x): xg X} of b-lods functions let us define the maximum 
b-lods estimate of co based on the observed outcome x as the set 


cb = {co: b{co; x) = sup h(co; x)}. 

CO 


It is clear that this procedure, which encompasses maximum likelihood and 
maximum plausibility estimation, has the property that the operations of 
estimation and reparametrization are interchangeable. 

One tends to think of lods functions — and in particular of probability, 
likelihood, or plausibility functions, or their logarithms — as having characteristic 
locations (or positions) and shapes. In many concrete situations this makes good 
sense, and is useful in summarily describing such a function and in classifying, 
sometimes only in a rather rough sense, the members of a family of such functions 
into subfamilies. 

It is often natural to indicate the position of the function by specifying that or 
those arguments for which the function takes its maximum. Lods functions which 
are quasi-concave or concave have a simple shape, and use of the maximum 
points as location indicators is particularly natural with such functions. Quasi- 
concave and especially concave log-probability, log-likelihood, and log- 
plausibility functions occur frequently in statistics and will play a rather 
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prominent role in the present treatise. 

A lods function of the form 

(1) co-x — h(o)) — d{x\ 

where coG^l c xeX a will be designated as linear. The log-probability, 
log-likelihood, and log-plausibility functions of an exponential model 

p{x;oS) = a(co)6(x) e"‘^ 

are ail of this type. 

Clearly, if a b-lods function is linear and given by (1) then, under smoothness 
assumptions, the maximum b-lods estimate of co is determined by the equation 

Dh{w) — X. 

Furthermore, concavity of a linear f- or b-lods function is equivalent to convexity 
of, respectively, d and h. 


3.2 PREDICTION FUNCTIONS 

Let X denote the outcome of a performed experiment and y the outcome of 
contemplated, independent experiment, which may be of a different type, and 
suppose the two experiments relate to one and the same set of hypotheses, 
indexed by the parameter co. The domains of variation of x, y, and co are denoted 
by 3£, and Q. A prediction function for y based on x is a non-negative function 
C(*|x) on ^ which is interpreted as expressing how credible (or likely or plausible) 
the various possible outcomes of the contemplated experiment are relative to 
each other, in the light of the observation x. Thus the interpretation of prediction 
functions is similar to that of f-ods or b-ods functions. 

Suppose that B{(o\ x) is a b-ods function for co based on x and that F[y\ co) is an 
f-ods function for y based on co. The product B(co;x)F(y;co), where F is the 
normed version of F, may be viewed as the joint credibility of co and y, and it is 
thus an immediate idea to consider 

BF(y\x) = sup B{co;x)F(y;co) 

O) 

as a prediction function for y. 

Now, let p{x; co) be a model function on 3E x Q and p(y; co) a model function on 
^ X Q, and take F(y; co) to be equal to p(y; co). Then 

F(y;co) = J7(co;y), 

and choosing B(co;x) = L(co;x) one obtains BF = L, say, where 
(1) L{y\x) = sup L(co; x)/I(co; y), 


B 
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which will be called the likelihood predictionfunction. Similarly,, with the choice 
B{a):x) = nia};x) the prediction function 0 is denoted by i7 and is called the 
plausibility prediction function, and one has 

(2) n(y\x) = sup n((jo; x)n{co;y). 

CO 

The set of points co for which the supremum on the right hand side of (1) is 
attained will be denoted by ^ or ^ (y|x), and the maximum likelihood predictor is 
the function f on 3E such that f(x) is the set of points y which maximizes L(-|x). The 
symbols d}(y|x) and y — the maximum plausibility predictor — are defined 
similarly in the plausibility case, i.e. (2). 

Example 3,1. Suppose x and y are binomially distributed for trial numbers m and 
n, and with common probability parameter ne{0, 1). 

The likelihood prediction function may be written 



where x is a mean value and y a mode point, corresponding to a common n and 
determined so that x + y = x + y; this is provided such n, x, and y exist. (If they 
do not exist the expression (3) still holds good in a generalized sense.) 
Furthermore, the plausibility prediction function is 


where x and y are mode points corresponding to a common it and such that 
X + y = X + y. Formulas (3) and (4) follow simply from general results given in 
Section 9.7. 

For n = 1 one finds 



L(l|x)/L(0|x) = 


^ {(X + 1)"-^ Hm-x- 1)"*-^“ '}/{x"(m - x)'”“ 
(m/Tr/{x^{m~xT-^} 

1 

(m/2)7{(x -- If-fm - X + 7 

^ {x^(m - x)^ ”^}/{(x -lf~\m-x+ir~^-^ 






and 


f(x+l)/(m-x) 


/7(l|x)/n(0|x) = h 


[x/(m-x+l) 


forx<[i(m+ 1)} 
forx = [^m+ 1)} 
forx>[i(m+ 1)} 


ifx<^m— 1 
ifx=5m — 7 
iix = jm 
ifx='|m4-i 
ifx>im+l 


► 
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Suppose that the maximum likelihood estimate d> = co(x) of oj based on x exists 
and that the probability function p(y; co) has a mode y{cjo) whatever the value of co 
in O. Then, since plausibilities are less than or equal to 1, 

L(y\x) = sup L{(o; x)i7(co; y) 


< supL(a);x) 

(a 

= L{(jd; x) 

= L{cjo;x)n{(jo;y(cjo)) 


and hence 

(5) .v{u>(x)) c y(x). 

The same kind of argument shows that 

(6) y(c5(x)) c ^(x). 

It is obvious that when a>(x) and c5(x) exist the inclusions in (5) and (6) are, in fact, 
equalities except in rather pathological instances, i.e. as a rule 

yix) = y(m(x)) 


and 


7(x) = y(d)(x)). 

Thus these predictates are simply the sets of mode points corresponding to the 
respective estimates of co based on x. 

Finally, it may be noted that if the outcome y is to be predicted but without x 
having been observed then it is natural to define the likelihood and plausibility 
prediction functions as 


L{y) = niy) = supiI(co;y). 

CO 

In the case when the family of probability functions for y is universal L( • ) and /J( * ) 
are both constant (and equal to 1). 

Suppose, for example, that a coin is to be thrown n times and no information is 
available about the probability n that it turns up heads in any single throw, and 
that the aim is to predict the number of heads y. Since the distribution of y is 
universal, any of the possible values of y are held equally credible by likelihood 
prediction as well as by plausibility prediction. The same conclusion is true if the 
outcome y to be predicted is the actual sequence of heads and tails. 
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33 INDEPENDENCE 

Let ir be a coliection of variabies and let denote their 

domains of variation. If the domain of variation M, say, of the combined variable 
w = ( Wj , . . . , wj is equal to the product Mi x . • • x of the single domains 
then are called variation independent. Furthermore, let /be a non- 

negative function defined on M. Then Wj , . . . , are said to be independent under f 
provided 

(i) H’l , . . . , vv^ are variation independent. 

(ii) There exist (non-negative) functions defined, respectively, on 

Ml,..., Mm such that /(w) = //wi) . . ./Jw J, w == (wj , . . . , w J e M. 

The definition of independence under a function / covers the classical definition 
of stochastic independence of random variables in terms of densities, as well as the 
definition of independence of cr-algebras under a probability measure. Dual to the 
former is the following definition of likelihood independence, or, for short, 
L-independence. Let p{x ; co) be a model function and let (co^ ^ \ . . . , be a partition 

of the parameter vector w. Then are said to be L-independent at x (e 3c) 

provided they are independent under p(x;-) and, furthermore, co^^\ . . . , are 
called L-independentif they are L-independent at x for every xe X. (Obviously, one 
might also introduce a concept of plausibility independence or U -independence. 
However, jf7-independence seems to be so rare a property as to be of no real 
interest.) 

The remainder of this section consists of some comments on the use of 
L-independence, and of a number of examples of this property. 

Example 3.2. Let ^ be the family of Ac-dimensional multinomial distributions with 
fixed trial parameter n. The model function is 


( 1 ) 


nl 


..Xkl(n - X 


- Til 




and the domain of variation of the parameter tc = (tii is the /c-dimensional 

simplex ^ 

(2) n = {7i:7Ci >0,...,7rfc> 0,711 + ••• + 7rfc< !}• 

The set of equations 


COl = 7li 


0)2 


\ — Til 


(Ok 




1 — 7Ci — • • 
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determines a reparametrization of ^ and co = has domain of 

variation 

a = ( 0 , i)\ 

Moreover, (1) can be written 

b{xi:n,coi)b{x2;n - Xi ; m 2 ) . . . ^(x^; n - x^ Xk-uOk) 

where b{z;v,a) denotes the point probability at z for the binomial distribution 
with trial parameter v and probability parameter a. 

Thus coi , . . . , o)/, are L-independent. ► 

A main consequence of the property of L-independence is that it simplifies the 
handling and visualization of a likelihood function. In particular, the problem of 
finding the maximum likelihood estimate of w falls into m separate pieces if 
are L-independent. 

Note also that if prior likelihood in the sense of Edwards (1969) is available and 
if the components \ . . . , are likelihood independent both under the prior 
likelihood and under the likelihood provided by the actual data then they are also 
independent a posteriori. Also, if . . . , are independent at x and if they 
follow an a priori probability distribution under which they are (stochastically) 
independent, then they are independent too under the a posteriori distribution. 

A special, close connection between certain cases of stochastic independence 
and L-independence, in exponential families, will be discussed in Section 9.2. 

Besides the examples of L-independence mentioned here some instances may 
be found in the discussions of cuts and S-ancillarity to be given in Sections 4.4, 
10.2, and 10.3. 

Example 33. Other types of factorization of the likelihood function for the 
multinomial family than that mentioned in Example 3.2 are possible. For 
instance, when /c = 3 the expression (1) can be recast as 

+ ^2;J^5C()h(Xi;xi + X2,p)b{x^;n - x^ - X2,y) 

where cc = %i + K 2 , P = + '^ 2 )^ y = — tti — 712 ), and (a, P, y) varies in 

( 0 , 1 )^ > 

Example 3.4. Let x^ and X 2 be independent and Poisson distributed with mean 
values Xi and 2.2. The distribution of x. is Poisson with mean value 2., while 
conditionally on x. the variate Xj follows the binomial distribution having trial 
number x. and probability parameter x = 2i/2 . Thus 2 and x are L-independent 
provided they are variation independent, which is the case not only if 2 = (2 1 , 22 ) 
varies freely in (0 , 00 )^ but also if, for instance, 22 is known to be less than or equal to 
2i, since then (2,x) has domain of variation (0, 00 ) x [j, 1) 

A concrete example of an experiment for which the latter model with 22 2i 
could well be appropriate is that of readings of a Geiger counter, without and with 
a piece of material inserted between the counter and the radioactive source. ► 
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Example 3.5. Consider- a finite collection {x,:i6l} of independent Poisson 
variates, let /j , . . . , be the elements of a partition of / and set 

k=l,2,...,r. 

ielk 

From Examples 3.2, 3.3, and 3.4, one sees that the joint distribution of the may 
be split into the marginal (Poisson) distribution of x , the conditional (multi- 
nomial) distribution of xj*^ given x., and the product over k of the conditional 
(multinomial) distribution of {x,-: ielj,} given x^!^\ and that corresponding to this 
factorization one has L-independent subparameters. 

The unconditional model for {Xi.- i e /}, the conditional model given x. , and the 
conditional model given x^’'^ are, of course, the three base models for contingency 
table analysis. ► 


Example 3,6. The negative multinomial distribution of order k and with shape 
parameter x and probability parameter tc = (tci , . . . , ttj^) is the distribution on Nq, 
where Nq = (0, 1,...}, having point probabilities 


nx X ) 

r(x, + i)...r(x,4-i)r(z) 




— ny. 


The domain of (x, n) is (0, oo) x n where fl is given by (2). 

Several types of factorizations are possible here; consider, for instance, the case 
k-2. Let i”(*;-, •) denote the model function for the negative binomial 
distribution. Conditioning on x^ yields the factorization 

b~{xi;x,p)b~(.X2;x + xuttz) 

where p = niKl - 712 ). And conditioning on x gives 

b''{x.;x-,n.)b(xi;x.,a) 

where a = Ttiln_. ^ 


Example 3.7. Suppose Xi,...,x„ is a random sample from the r-dimensional 
normal distribution Nd.^, E) and let ^ be the class of probability measures of 
X = (xi,...,x„) for a,!,) varying freely, 2 being nonsingular. Split ^ into two 
components, i of dimensions q and r - q, respectively, and partition 

2 correspondingly. Then = (^'^',2ii) and 


or 






^21 ^11 ^ 12 ’ ^1/ ^12) 


are independent, (i.e. the parameters of respectively the marginal distribution of 
X; * and the conditional distribution of x.^^ given x.'* are L-independent. (The 
variation independence of and is not difficult to verify directly, but 
may also be obtained as an immediate consequence of Theorem 9 3 ’ see 
Example 9.5). ’ ^ 
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Example 3.8. In a medical study, the I persons under observation entered the 
study at individual times < • - <ti butwerethencecontinuously monitored till 
a time T(> t^). All the persons were fit at the time of entrance but might, during 
the observation period, interchangeably be in this state and a state of disabledness, 
and might pass from each of these states to that of death. Supposing that the 
time-state records for the various persons are a set of I independent observations of 
a Markov process with transition intensities between states as indicated in 
the diagram ^ 


fit 


disabled 



, V 

dead 


then the likelihood function becomes 


(3) (f v” p"- 

where 

m = number of deaths among fit persons 

s, = number of disablements 

n = number of deaths among disabled persons 

r = number of recoveries 

u = total time lived in fit state 

w = total time lived in disabled state. 

This model was studied by Sverdrup (1965). 

It follows from (3) that the parameters p, a, v, and p are L-independent, on the 
assumption that (p, cr, v, p) varies in (0, oo)"^. 

Moreover, the parameters p + cr, g/{p + cr), v 4 - p, and p/(v + p) are 
L-independent. 

Similar results hold for the birth and death process. Denote the birth and death 
intensities by A and p. Then the likelihood function, based on continuous 
observation during a time interval [0, T] of a population consisting initially of I 
individuals, is given by 

where 


b = number of births 
d = number of deaths 
z = total time lived. 
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Hence k and p are L-independent, and so are A -f /i and A/(A + p). 

Indeed, it is obvious from these cases that analogous conclusions hold 
generally for Markov processes with continuous time and finite state space. 


3.4 COMPLEMENTS 

(i) The shapes of the lods functions of a given family of such functions can 
sometimes be made simpler or more uniform through transformation to another 
argument variable. The normalizing and the variance-stabilizing transformations 
of probability functions which have long been of standard use in statistics are of 
this kind, and more recently analogous transformations have been introduced for 
(log-) likelihood functions (see Anscombe 1964a, Sprott 1973, and Box and Tiao 
1973). A treatment of these two particular types of transformations — normalizing 
and spread-stabilizing — will be given, for one-dimensional exponential models, in 
Section 9.8 (v). 

(ii) Approximate L-independence. Consider a model function p(x;co) and a 

partition . . . , of o). In cases where are not L-independent 

it may still be, of course, that the likelihood function p(x);*) factorizes either 
exactly or approximately in a neighbourhood of the maximum likelihood 
estimate. Such a property is often helpful in the statistical analysis, both con- 
ceptually and with respect to the numerical and graphical handling of the 
likelihood function. 

The following kind of approximate factorization of p(x; •) is of particular 
interest. Let I denote the log-likelihood function, 

/(•) = lnp(x; •). 

The components are said to be infinitesimally L-independent at x 

provided the likelihood function has a unique maximum point co which belongs 
to the interior of Q and provided 


(1) 


dH 


(d3) = 0. 




Anscombe (1961, 1964a,b) has discussed the possibilities of obtaining 
infinitesimal L-independence, as well as approximate normal shape, of the 
likelihood function by transformation of the parameters (see Sections 9.8(v) and 
(vi)). 

Infinitesimal L-independence is closely related to the concept of orthogonality 
of parameters introduced by Jeffreys (1948). 

The parameters a)^^ \ . . . , are said to be orthogonal under if 



( 2 ) 
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Under standard regularity assumptions this is equivalent to asymptotic stochas- 
tic independence of the components of the maximum likelihood 

estimate co in the asymptotic normal distribution of d>. 

If the dimension of m is k then the problem of finding a transformation of co 
such that the k one-dimensional components of the new parameter are ortho- 
gonal amounts to solving (5) partial differential equations in k unknown 
functions. As mentioned by Huzurbazar (1950), it is therefore to be expected that 
when k> 3 the problem will be solvable only for very special families of 
distributions. For k = 3 and especially for k = 2 there is better hope of solving the 
equations. Huzurbazar (1950, 1956) indicated a way of doing this and applied it to 
a number of two-parameter distributions. 

Clearly L-independence implies both infinitesimal L-independence and ortho- 
gonality. Note also that orthogonality in many cases implies approximate 
infinitesimal L-independence and vice versa, because for large sample sizes the left 
hand side of (1) tends to be near the left hand side of (2). For exponential families 
the notions of orthogonality and infinitesimal L-independence are even more 
closely related (see Section 9.8(vi)). 

3.5 NOTES 

The method of constructing prediction functions described in Section 3.2 was first 
proposed in Barndorff-Nielsen (1976b) and the two particular cases of plausibility 
and likelihood prediction have been briefly discussed there and in BarndorfT 
Nielsen (1977b). The general method was prompted by (4) of Section 3.2, which 
was originally derived as a formal plausibility analogue of Fisher’s (1956) (likeli- 
hood based) prediction function for binomial experiments. More specifically, the 
two latter prediction functions are equal to, respectively, the plausibility ratio 
and the likelihood ratio test statistics for the hypothesis that the probability para- 
meter is the same in the two experiments. The reasoning leading to likelihood and 
plausibility prediction functions, which is given in Section 3.3, is of a different 
kind, and it appears that Fisher’s prediction function is not derivable as a special 
case of the general construction discussed in that section. For binomial 
experiments, both the likelihood and the plausibility prediction method of 
Section 3.2 have the property that if the sample size m of the observed experiment 
tends to infinity and the probability parameter n thus becomes known then in the 
limit the prediction function is proportional to the probability function for the 
unobserved experiment; this property is not shared by Fisher’s method. 
Mathiasen (1977) has given a detailed study and comparison of exact and 
asymptotic properties of four particular prediction procedures, including 
plausibility prediction and the generalization of Fisher’s proposal for binomial 
experiments. References to other approaches to prediction may be found in 
Mathiasen’s paper. 




CHAPTER 4 


Logic of Inferential Separation. 
Ancillarity and Sufficiency 


The operations of margining to a sufficient statistic and conditioning on an 
ancillary statistic are the primary procedures leading to separate inference, i.e. 
inference on a parameter of interest based on only a part of the original model and 
data. The logic relating to inferential separation and in particular to the general, 
intuitive notions of ancillarity and sufficiency is discussed in Section 4.1, and then 
various, mathematically defined concepts of ancillarity and sufficiency are 
studied in Sections 4.2, 4.4, and 4.5. A key problem in this connection is that of 
giving precise meaning to the notion that a certain part of the model and data 
does not contain any information with respect to the parameter of interest, and 
definitions of this notion, which is called nonformation, are given in Section 4.3. 
Finally, Section 4.6 contains some results on the relations between conditional 
and unconditional plausibility functions. 


4.1 ON INFERENTIAL SEPARATION. ANCILLARITY 
AND SUFFICIENCY 

Let u and v be statistics, let be a (sub)parameter with variation domain T and 
suppose that the conditional distribution of u given v depends on co through \j/ 
only, and is in fact parametrized by \j/. Making inference on xj/ from u and from the 
conditional model for u given the observed value of v is an act of separate 
inference. In such a connection it is customary to speak of \J/ as the parameter of 
interest. That part of co which is complementary to ij/ will be termed incidental. 
(The more commonly used, but somewhat emotional, term nuisance for this 
complementary part is avoided here.) 

It may or it may not be the case that, in the given context, that part of the data x 
and the model ^ which is complementary to the above-mentioned model-data 
basis for the separate inference can be considered as irrelevant or containing no 
available information with respect to xl/. If it is the case then the complementary 
part will be called nonformative with respect to \j/. 

On the general principle that if something is irrelevant for a given problem then 
one should effect (if possible) that it does not influence the solution of the 
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problem, one is led to conclude that it is proper to draw the inference on ij/ 
separately provided that, as discussed above, the complementary part is nonfor- 
mati\e. This conclusion, which may be called the principle of nonformation, 
specializes to the principle of ancillarity by taking w = x (the identity mapping on 
II and to the principle of sufficiency by taking v constant. A detailed discussion of 
these two latter principles is given in the following. 

Principle of Ancillarity. Suppose that, for some statistic t and some 

parameter \j/, 

(i) The conditional model for x given the statistic t is parametrized by ij/. 

(ii) The conjunction t) is nonformative with respect to 

Then inference on \j/ should be performed from the conjunction (^(•|t),x). ^ 

When |i| and |ii) above are satisfied T is said to be ancillary with respect to ij/. 

In comprehending the ancillarity principle it is essential to think of the 
experiment, E say, with outcome x as a mixture experiment, i.e. as being composed 
of tw^o consecutive experiments, the first, having outcome t and the second, E\ 
corresponding to the sample space X, = [x: t{x) = t} and the distribution family 

-It). Under the conditions set out in the principle, the first experiment does not 
in itself give any information on the value of ij/ and the second experiment yields 
information on the value of ijj only and thus does not provide possibility for, as it 
were, drawing inference within the subfamilies = {P:\j/(P) = ij/}, One 
may judge, therefore, that if it is known that the experiment F has been 
performed and has led to the outcome x then the additional knowledge that the 
value t has been determined by the experiment E^ is irrelevant as far as inference 
on ij/ is concerned. The direct way to ensure that this irrelevant knowledge does 
not influence the inference concerning ij/ is to base the inference on x), as 

the principle prescribes. 

In many cases the precision of the inference on ij/ which the (conditional) 
experiment allows varies in a simple systematic way with the value t of the 
ancillary statistic, so that this value has an immediate interpretation as an index 
of precision. Thus t has a function analogous to that of the sample size. According 
to the principle of ancillarity, the appropriate precision to report is that indicated 
by t, not the precision relative to repetitions of the whole experiment E. These 
aspects were repeatedly stressed by Fisher (see, e.g., Fisher 1935, 1956). To 
illustrate them in the simplest possible, though somewhat unrealistic, setting, 
suppose an experimenter chooses between two measuring instruments by 
throwing a fair coin, it being known that these instruments yield normally 
distributed observations with mean value equal to the parameter of interest, p 
say, and standard deviations, respectively, 1 and 10. Whichever instrument is 
selected, only one measurement will be taken. Let t = 0 or 1 indicate the outcome 
of the coin throw and let u be the measurement result; thus x = {t, u). In a hypo- 
thetical, infinite sequence of independent repetitions of the whole experiment 
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the estimate u of p. would follow the distribution with probability function 


( 1 ) 




U ~ jl 
" 10 


where (p is the density of N(0, 1). However, the precision in the estimation of /i 
provided by the one experiment actually performed is described not by (1) but by 
either (piu - /z) or (p[{u - /i)/10]/10, depending on which of the two instruments 
was in fact used. In other words, the precision is given by the conditional model 
given the observed value of the ancillary statistic t. 


Principle of Sufficiency. Suppose that, for some statistic t and some parameter 

(i) The marginal model for t is parametrized by ij/. 

(ii) The conjunction x) is nonformative with respect to ij/. 

Then inference on ij/ should be performed from the conjunction {%J). ► 

The statistic t is called sufficient with respect to ijj when both (i) and (ii) of the 
sufficiency principle hold. 

As with the ancillarity principle, it is important for understanding the primitive 
content of the principle of sufficiency to think of x as being obtained by a mixture, 
or two-stage, experiment, where here it is the knowledge about the second 
experiment which is irrelevant for inference on \j/. 

In any given situation it is necessary to make precise what is meant by 
nonformativeness before it can be decided whether the principle of nonformation 
applies. Various mathematical definitions of nonformation are given in Section 

4.3. The simplest of these is B-nonformation which, in particular, yields the 

classical concepts of ancillarity and sufficiency. In the present book these classical 
concepts are called B-ancillarity and B-sufficiency while the words ancillarity and 
sufficiency are reserved for general indicative use. (A statistic t is B-ancillary if its 
distribution is the same for all and t is B-sufficient if the conditional 

distribution of x given t is the same for all P e For practical expository reasons 
the well-known theory of B-ancillary and B-sufficient statistics is dealt with 
separately in the next section, whereas the ancillarity and sufficiency notions 
which flow from the other specifications of nonformation presented in Section 

4.3, i.e. iS-, M- and G-nonformation, will be treated in Section 4.4. Some indications 
of the roles of all these concepts are however given next through discussion of 
some key examples. 

The earliest concrete example of what is here called a B-ancillary statistic was 
treated by Fisher (1934) in his first discussion of fiducial inference with respect to 
the location and scale parameters a and of a continuous type, one-dimensional 
family of distributions, the density being 

1 


X — a 
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where / is supposed to be known. For a sample of n observations Xj, . . . , x„ the 
statistic 

Xi X3 X4 Xfj 

Xi - X2’xi - X2’“*’3Ci - X 2 

which Fisher talked of as specifying the configuration or complexion of the 
sample, is clearly B-ancillary. 

Perhaps the most commonly occurring instance of a process of separate 
inference is that of making a linear regression analysis from n pairs of 
observations (Xp>’j), . . . , (x„,y„) in the cases where the xs vary randomly but are 
considered as given in the analysis. Here ij/ == (a, j?, <t^) where a and jS are the 
position and slope parameters of the regression line and is the residual 
variance. The separation logic in this situation is however seldom explicated (but 
see Fisher 1956, §4.3, and Sverdrup 1966). Under the usual assumptions of 
stochastic independence and identical distribution of (xi , yi), . . . , (x„, y^) (as well 
as normality of the conditional distribution of y given x), the marginal model for 
the xs together with the observation (x^ , . . . , x„) is indeed nonformative with 
respect to ij/ in one of the specific senses, that of S-nonformation, given in Section 
4.3, on the proviso that for every value of (a, ji, cr^) the class of distributions of 
(x 1 , . . . , x„) is the same. In other words, under the condition stipulated the statistic 
(Xi , . . . , x„) is S-ancillary with respect to {a, a^). The condition is satisfied, in 

particular, if the family of distributions of a single pair (x, y) is the family of two- 
dimensional normal distributions. 

The regression situation also gives an obvious illustration of the point made 
earlier about the function of ancillary statistics in providing the relevant 
indication of the precision in inference on the parameter of interest. 

One of the very first examples adduced by Fisher to illustrate his idea of 
ancillarity, and which has been of decisive importance for the development of 
separate inference, concerns the 2 x 2 table 


Xi 

ill -Xi 


X2 

«2 -X 2 

rii 

X. 

n. — a; 

n 


for two independent binomial variates, having probability parameters pi and p 2 , 
respectively (see Fisher 1935). Suppose the object is to test the hypothesis that the 
odds ratio 


( 2 ) 




Pi I P2 
1 - Pi / 1 - P 2 


has a particular value, ij/o say, the main possibility being, of course, ij/o^ I which 
corresponds to pj == P2- In relation hereto Fisher remarks: 
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Let us blot out the contents of the table, leaving only the marginal frequencies. If it be 
admitted that these marginal frequencies by themselves supply no information on the 
point at issue,... Ave may ... recognize that we are concerned only with the relative 
probabilities of occurrence of the different ways in which the table can be filled in, 
subject to these marginal frequencies. 

From a remark in Fisher’s reply in the discussion to his paper it is apparent that 
he was considering the marginal frequencies, i.e. in effect x, as ancillary not just 
with respect to but with respect to if. 

The conditional model given x. is parametrized by if, and x. is M-ancillary with 
respect to if. 

Suppose now that the data consist of a sample of n observations from a 
multivariate normal distribution, whose parameters are assumed to vary unre- 
strictedly. Inference on the matrix p of correlation coefficients may be performed 
separately, from its empirical counterpart r, without loss of information. In fact, 
the marginal distribution of r depends on p only and r is obtainable by suflScient 
reduction in two steps, first reducing to the set of empirical means, variances, and 
covariances by B-sufficiency and then to r by G- (or M-) sufficiency. 

Separation of a submodel for inference on a parameter of interest if may 
involve a sequence of applications of various of the ancillarity and sufficiency 
definitions (the latter example shows a simple case of this), and occasionally it is 
necessary to invoke even more general concepts of nonformation than those 
which have a decomposition that corresponds to such a sequence (see Section 
4.7(iv)). 

Sometimes different separation procedures, relating to one and the same 
interest parameter if and justifiable on grounds of nonformation, are applicable 
and it can happen that they do not lead ultimately to the same submodel. This 
lack of uniqueness is exemplified in Section 4.7(vi). (The remark on non- 
uniqueness in statistical conclusions made in Section 1.1 is relevant here.) Certain 
uniqueness results will be mentioned at the end of Sections 4.4 and 4.5. 

A striking illustration of the difference it may make whether the inference is 
performed separately or not is furnished by the problem of estimating the 
standard error a attaching to a measuring instrument, from duplicate measure- 
ments of n different items. Assuming independent normal variation of the 
measurements x^j, z = 1, . . . , n, j = 1, 2, the mean value of x^ being one finds 
that the maximum likelihood estimate of in this original model is s^/2 where 
= n“^ Z(Xij — Xi )^. Thus the estimator is not consistent, tending to as 
n -> 00 . In contrast, the maximum likelihood estimate from the conditional 
model given x*. = (xj , . . . , x„ ) is the usual estimate (The mean vector x*. is G- 
(and also M-) ancillary with respect to <t^.) 

The kind of breakdown of the direct method of maximum likelihood exhibited 
here is a common phenomenon in the class of cases where the number of 
incidental parameters tends to infinity with the number of observations. But in 
many important instances, primarily with exponential or partly exponential 
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models, the situation can be remedied, in analogy with the above, by separation 
through conditioning. The consistency and asymptotic normality of the con- 
ditional maximum likelihood estimator in such cases have been shown by 
Andersen (1973). Most of his results do not presuppose nonformativeness of the 
conditioning statistic, so that in general a loss of information (and, hence, 
efficiency) may be involved. However, such a loss may well be negligible, or at 
least acceptable, in comparison with the advantages gained by the separation. 


4.2 B-SUFFICIENCY AND B-ANCILLARITY 

The main parts of the mathematical theory of the classical concepts of 
sufficiency and ancillarity — here called B-sufficiency and B-ancillarity — are 
presented in this section. 

A statistic Tat a statistical field (X, 91, ^) is B-sufficient if the members of ^ have 
a common conditional probability measure given T In other words, Tis sufficient 
provided there exists a Markov kernel M(* : • ) on 91 x X such that for each A 6 91 
the function M(A, •) is measurable with respect to o{T\ the cx-algebra generated 
by T and such that 



M(A, •) dP 


for each Pe% Be<j(T).A minimal B-sufficient statistic is a B-sufficient statistic T 
such that if f is any other B-sufficient statistic then a{T) a cr(T) v 0, where 0 
denotes the cr-algebra generated by those sets A 6 91 for which P(A) = 0 for every 
Pe% 

If the probability measures in 91 are mutually absolutely continuous let Pq be 
an element of and generally take Pq to be of the form Pq = with 

Pn e c„ > 0 (n = 1,2,...) and i such that a set A e 91 belongs 

to 0 if and only if Pq(A) = 0. (The existence of such a Pq is well known and was 
established by Haimos and Savage (1949).) 

In the following, when the symbol [^] is put after some relation this indicates 
that the relation holds up to set(s) of P-measure 0 for every Pe^. 

Theorem 4,1, A statistic T is B-sufficient if and only if for each P g ^ there exists a 
T-measurable version of dPjdPo. 

Proof It causes no loss of generality to assume that the elements of ^ are 
mutually absolutely continuous. (To see this, consider the family 
{4(P + Po):P6$}, which has this property.) 

Suppose T is B-sufficient, let M be the Markov kernel for the common 
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conditional distribution given Tand, for a fixed Pe% set 

U{x) = xedt. 

J dPo 

It is obvious that U is Immeasurable. Furthermore, U is a version of dPjdPo 
because 

U = [Po] 

dPo 

and hence, for every AeSS., 

r dP 

= \jpMiA-,-)dP, 

= P(A), 

Conversely, let dP'dP^ be T-measurable and define M(* ;•) as the Markov kernel 
for the conditional distribution given T under Pq. Then for every Ae^, Bea(T) 

J MiAr}dP= jls{Ei;iA)^dPo 
= P{AB) 

showing that M(‘ ; •) is also the Markov kernel for the conditional distribution 
under P. ^ 

For the proof of the next theorem and its corollary the following four lemmas 
will be needed. In these lemmas the members of ^ are presupposed to be mutually 
absolutely continuous, A denotes an element of % and S a sub-cr-algebra of 21. 

Lemma 4.L One has 

S V 0= {,4:l^ = £^UPo]}- 

Lemma 4.2. Let Y be an integrable stochastic variable and suppose that 
<T(y) c S u 0.Then 7 = Y [PJ. 
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Proof This may be verified by showing that £o T is a version of £ 0 "^ X which is 
straightforward to do, using Lemma 4.1. ^ 

The ( 7 -algebra $ is said to be separable if there exist sets B; e S, i 6 N, such that 
S = (7(lB,:ieiV). 

Lemma 4.3. ® is separable if and only if it is generated by a stochastic variable. 

Proof. The if assertion is trivial. 

Suppose SB = (7(1 b,: ieiV) and define Y by 

£ 3“Tb. 


It is enough to prove that any set of the form Bi n B 2 n . . . n where B,- equals 
either or A": (n = 1,2, . . . ; 




1,2, . . ., n) is contained in g(Y). Set 

0 if Bi = At 

1 if Bi = Ai 


and 


E 3"ei. 


i- 1 


Then 82 ^ ...rs = {b < Y < b + 3 


Lemma 4.4. For any 25 there exists a separable a-algebra So such that 
So c S c So V 0. 

Proof. The cr-algebra 21 is separable and hence 21 = ieiV) for some family 
ieN]. Define So by 

So = a(£?U,:z6iV). 

This <T-algebra is separable since it is generated by a countable set of stochastic 
variables, and clearly Sq c: S, Furthermore, 

2[ = {X: there exists a SQ-measurable version of 


from which follows that 1„*=E^ K [Po] for every PeS. Consequently, 
by Lemma 4.1, S e S^ v ,0. ► 

Theorem 4.2. A statistic T is minimal B-sufficient if and only if 
a[T}^a{dP/dP,:PE^}m. 


Proof As in the proof of Theorem 4,1, one may assume mutual absolute 
continuity of the members of 
Set So = <t{T} and S = a{dPldPo: PeS^}. 

If T is minimal B-sufficient then, by Theorem 4.1, S cz So [^]. Let f be a 
statistic such that cT(f) c S e a'(f) v The existence of such a statistic is 
apparent from Lemmas 4.3 and 4.4, Applying Lemma 4.2 one obtains 


dP 



[Po] 


and this, again by Theorem 4,1, shows that f is B-sufficient, and the minimal 
B-sufficiency of T then yields So c: S [S]- 
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On the other hand, suppose SBq = ® [^]. From Lemma 4.2 it follows that 
E^(dP/dPo) is a version of dP/dPo for every P e ip, whence, using Theorem 4. 1, Tis 
B-sufficient. If f is any B-sufficient statistic then S <= <T(f)[ip] which implies 
©0 <= off) [ip], as required for minimal sufficiency of T. 

Corollary 4.1. A minimal B-sufficient statistic exists. 

Proof. Invoke Theorem 4.2 and Lemmas 4.3 and 4.4. ► 

With $ parametrized, ip = {ip„: o) e Q}, set 

q{x\co) = {x) 
uPq 

and let r be the mapping which maps a point xeX to the likelihood function 

r(x) = q(x; •). 

Furthermore, let the range space of r be endowed with the product cr-algebra 
where © is the Borel cr-algebra in R. Then r is measurable and 

(1) cr{r} = c7{^(.; co): coeQ}. 

(Here, and until the end of Example 4.2, equalities, inclusions, etc. are strict, i.e. 
not modulo null sets.) This proposition represents one precise interpretation of 
the common phrase ‘the likelihood function is minimal sufficient’ (cf. Theorem 4.2 
and Corollary 4.1). However, rather than this interpretation, the phrase reflects 
the useful fact that if t is a statistic generating the same partition of X as the 
mapping r, i.e. 

(2) t{x) = t{x)<^q{x; co) = q(x; co) for every coeQ, 

then, as a rule, t is minimal B-sufficient. That some regularity condition is needed 
to ensure the minimal sufficiency of such a statistic t is illustrated by: 

Example 4.1. Let Xi and X2 be independent and normally distributed, 
Xi 1), X2 N{(o, 1) with coeQ = R. Then X2 is minimal sufficient with 

respect to the family © = {P^: coe^l} of joint distributions of x = (xi,X 2 ). 

With X = jR^, a version of dPJdPo is given by 
q{x; co) = [1 - <5(xi - 

where 5 is the function on R which is 1 at the origin and 0 otherwise. Clearly, 

q{x; .) = q{x'; •)ox = x' 

which means that x generates the same partition of X as does r: x q{x; •), 
although X is not minimal sufficient. ► 

Theorem 43. Suppose t is a statistic which generates the same partition of X as the 
mapping r: x q{x; •)• 
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If either ‘ip is discrete or q{x: co) is continuous in o) for each fixed x then t is 
minima! sufficient. 

Remark. The discrete case is straightforward to verify. If, in a given case, the 
conditions of the theorem are not fulfilled it may well be that minimal sufficiency 
can be established by the method of proof below which, as should be apparent, 
offers scope for considerable generalizations. 

Proof. First some general results on mappings and cr-algebras determined by the 
mappings will be stated. 

For any mapping / on an arbitrary measure space (£, G), let d(f) denote the 
partition «j-algebra determined by f i.e. the d-algebra of those sets C g G which are 
unions of elements of the partition of E generated by f. Clearly, 
S(f) = {C G G: C = f~^(f{C)}} and, if/is a measurable mapping, 

f3) a(f) cz 5{f). 

Under mild regularity conditions one has, in fact, that a{f) = d(f). This will be 
seen from the following proposition which is a special case of Theorem 3, p. 145, in 
Hoffmann-Jorgensen (1970). 

Lemma 4,5. Let F, F and G be Borel subsets of complete, separable metric spaces, 
endowed with the Borel a-algebras. Let f and g be measurable mappings from E into 
Fund G, respectively, and suppose that the partition of E generated by f is finer than 
the partition generated by g. Then there exists a measurable mapping hfrom F into 
G such that g = h of 

Taking g to be the indicator function of an element of 5{f) one finds that this 
element belongs to G(f). Hence one has: ^ 

Corollary 4.2. Suppose f is a measurable mapping from a Borel subset of a 
complete, separable metric space into a complete, separable metric space. Then 
4f) = b{f). 

It follows that, always, 


(4) (t(D = 5(D = ,5(r) 

(and hence, in view of (1) and (3), that t is sufficient). 

Now, suppose that q(x:-) is continuous on £2 for every x 6 X. Let £1^ be a dense 
subset of Q. and let Tq be the mapping on X such that rofx) is the restriction of 
•) to £io- Then, by the continuity, <7(r) = ff(ro) and determines the same 
partition of X as r (and t). Therefore, on account of (4) and Corollary 4.2, 

cr{t) = 5iro) = <T{ro) = <j{r) 

and since (^(r) is minimal sufficient so is t. Thus Theorem 4.3 is verified. ► 
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Corollary 43. Suppose that t is a statistic which generates the same partition of 
as the likelihood function p{x; •), 

t(x) = (tx) o cp(x ; 0 )) - cp(x ; o)) for every o; e Q 

for some positive c and c which do not depend on co but may depend on x and ic, 
respectively. 

If either $ is discrete or p{x; oj) is positive and continuous in cofor each fixed x 
then T is minimal sufficient. 

Proof Pq is of the form Taking the version of dPo/dp given by 

Po{x) = Y, c„p(x; co„) 


and setting q{x; co) = p{x; co)lpo(x) one obtains that Theorem 4.3 applies. ► 

Example 42. The model function for a sample Xi,...,x„ from the Cauchy 
distribution with mode co is 


p{x; co) = 71 ” IT 


If X = (.Xi, . . . , x„) and X = (Xi, . . . , .x„) satisfy cp(x; ■) = cp(x; •) then 
C n (1 + {Xj - CO)^) = C fj (1 + (Xj — co)^). 

i=l 

Both sides of this equation are polynomials in co and hence the equality holds for 
all coeR precisely when these two polynomials have the same roots. Since the 
roots are Xj + i,J= respectively Xj + i,J = one sees that the 

order statistic {X(i), . . . , X(„)) is minimal B-sufficient. ^ 

We now turn to a discussion of B-ancillary statistics, i.e. statistics f such that 
P, does not depend on Pe^, and of relations between B-ancillarity and B- 
sufficiency. (For most of this discussion the presupposition that the familyft is 
dominated is not needed.) 

Only seldom does it happen that a B-ancillary statistic t exists which is ma.ximal 
m the sense that if fis any other B-ancillary statistic then ff{f) c a{t) ['ip]. Another 
way of expressing this is that if ti and t 2 are B-ancillary then t — (t, fjl will in 
general, not be B-ancillary. ’ 

A B-ancillary statistic t will be called relatively maximal if for any B-ancillary 
statistic f with o-(t) c cT(f) [ip] one has, in fact, ff(f) = a(i) 


Excmple 4.3. Let (x,-, y;), i - 1, . . . , n, be independent, two-dimensionally normallv 
distributed random variables with E{x,) = E{v) = 0 F(x)=F(v)=-{ 

nx., yO = p where p varies in ( - 1, 1). Then (x„ . . . , x„) and (y^, v„) are both 

B-ancillary but together they constitute the whole sample. ► 


the terms 'unique maximal’ and ‘maximal’ are con 
esignate what is in this book called maximal and relatively maximal, respectively 


commonly used to 
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Example 4.4, Consider two geneioci each having two allelic genes denoted, 
respectively, by A, a and B, b. Both A and B are assumed dominant. Suppose one 
has observed the phenotypes of n individuals sampled randomly among the 
offspring of a population consisting entirely of doubleheterozygotes of trans type 
(chromosomal arrangement Ab/aB), the observations being set out as in the table 
below. 



A- 

aa 



Xii 

X12 

Xi. 

bb 

X21 

X22 

X 2 . 


Xi 

X .2 

n 


On the assumptions of random union of gametes and no selection, the 
corresponding table of probabilities is 



A- aa 


B- 

2 + 71 1 — 7U 

3 

4 4 

4 

bb 

1 —7t 71 

1 




4 4 

4 


3 1 

1 


4 4 


where the parameter n is the product of the recombination probabilities for males 

and females. 

Xi. and x.i are obviously both B-ancillary, but jointly they are not B-ancillary. 
This is apparent for instance from the fact that the probability that both Xi. and 
x.i are 0 equals (n/4f. ► 

In Examples 4.3 and 4.4 the observations do not constitute a minimal 
B-sufficient statistic. This, however, is the case in the next example. 

Example 4.5. Consider a two-by-two contingency table 

Xji Xi2 

X21 X22 

with the total fixed, and cell probabilities 

i(l + k) i(2 — n) 

1(1 - 7c) ^2-1- n) 



B-Sufficiency and B-Ancillarity 45 

with % varying in ( - 1, !)• Here ^ 22 ) is minimal B-sufficient while Xj. 

and x.i are each relatively maximal B-ancillary. Thus a maximal B-ancillary 
statistic does not exist. ► 

Examples of this kind were first given by Birnbaum (1961, 1962) and Basu 
(1964). 

At a number of places later in the following it will be convenient to use the 
notion of (bounded) completeness. The family $ is called (bomdedly) complete if 
for every (bounded) real-valued function / on 3E which satisfies 

jfdP = 0, P6% 

one has 

/ = 0 m- 

This notion does not seem to have any significant statistical meaning, but 
completeness or bounded completeness implies various properties of consider- 
able statistical interest. 

The concepts of B-sufEciency and B-ancillarity are both special cases of 
conditional B-ancillarity . A statistic u is called conditionally B-ancillary given the 
statistic w if the members of ^ have a common conditional distribution of u given 
w. B-sufficiency obtains for u — x, while B-ancillarity corresponds to w being a 
constant. 

Suppose u is conditionally B-ancillary given w, let A be a w-measurable event 
and let M(A; •) be the common conditional probability of A given w. Clearly 

lJdP = 0, P6^. 

Hence, if ip is boundedly complete 

M(A;-)= 

This conclusion may be paraphrased as follows. 

Theorem 4,4 //ip is boundedly complete then there are no nontrivial instances of 
conditionally B-ancillary statistics. 

As one would expect, if there are no nontrivial conditionally B-ancillary 
statistics then, in particular, x is minimal B-suffident. To prove this, suppose t is 
B-sufficient, let A be any event and set B = {M{A; •) = 1} where M(*; •) is the 
common Markov kernel. By assumption, M(A;*) = 1^[^] and consequently 
= 1^ [ip], as was to be verified. 

Let t and u be statistics, assume that t is B-sufficient, and let M denote the 
Markov kernel for the common conditional distribution given t. In the case t and 
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u are independent under a Pe% then for any Ce< 7 {u) 

P(Q ^ PiC\t) = M(C-) [^]. 

This implies that u is B-anciilary. 

A converse assertion, due to Basu (1955), holds for boundedly complete. 
This is stated below as a corollary to: 

Theorem 45. Let t, u, and w be statistics, suppose that (t, w) is B-sufficient and that u 
is conditionally B-ancillary given w, 1^ boundedly complete then t and u are 

conditionally independent given w. 

Proof Let M and be the Markov kernels of the common conditional dis- 
tributions of, respectively, x given {t, w) and u given w Then for every C e a{u) and 
Pe^I one has 

'm(C;*) = M^j(C;-))dP = 0 

whence, by bounded completeness, 

Y(C;.)==Yo(C;.) m 

which means that the conditional distribution of u is the same given w as given 
(t,w). ► 

Corollary 4.4. Suppose t is B-sufficient and u is B-ancillary. If is boundedly 
complete then t and u are independent. 

Illustrations of Theorem 4.5 and its corollary — the latter is known as Basu’s 
Theorem (Basu, 1955, 1958) — will be presented in Section 8.1, after a general 
sufficient condition for completeness of exponential families has been established. 


43 NONFORMATION 

As mentioned in Section 4.1, a key concept of inferential separation is nonfor- 
mation, i.e. the concept of a submodel containing no information with respect to a 
specified parameter function. Four mathematical definitions of nonformation, 
namely B-, S-, G-, and M-nonformation, are given in the present section. These 
specify circumstances under which, whatever the outcome of the experiment, no 
information on the parameter function is contained in the conjunction of the 
submodel and the part of the data with which the submodel is concerned. More 
generally, one may have situations where this conjunction is nonformative for 
certain experimental outcomes but not for others. It is possible to extend the 
definitions of B-, S-, and M-nonformation to include such cases of pointwise 
nonformation. 

Let u and v be statistics and let be a parameter function. Then u, v, and the 



Nonformation 47 


submodel {Pc,u{-\v): cogO}, consisting of the conditional distributions of u given 
r, may be nonformative with respect to ij/ in one or more of the following senses. 

B-nonformation: For every value of v the conditional distribution PayU{-\v) does not 
depend on ► 

S-nonformation: For every value of v the family {P^u(‘\v): il/io)) = ij/} does not 
depend onil/. ► 

G-nonformation: For every value of v and ij/ the family {Pj^{‘\v): \l/{a)) = is a 
union of families each of which is generated by a transitive group of trans- 
formations. ^ 

M -nonformation: For every value of v and ij/ the family {Pji(‘\v): = ij/} is 

universal. 

(In the definitions of G- and M-nonformation the requirements of transitivity 
respectively universality refer to the set of all realizable values of u under the 
whole family not just the family {P^w(-|t;): \I/{cd) = \j/}.) 

Each of these four definitions satisfies the natural requirement that nonfor- 
mation with respect to ij/ implies nonformation with respect to any parameter 
function which depends on o) through if only. 

Obviously, B-nonformation implies S-nonformation. Moreover, G-non- 
formation does, as a rule, entail M-nonformation, cf. Lemma 2.1. 

The above four definitions are global in that they state that irrespective of 
which values of u and v are realized no information on if can be extracted from 
these values plus the submodel in question. However, in general, different 
realizations carry different amounts of evidence and it is reasonable to ask for 
pointwise versions of these definitions, concerned with whether any particular 
observed values u and v are nonformative. G-nonformation does not seem to lend 
itself naturally to such an extension, but the other three notions do. It is natural to 
formulate the pointwise definitions in terms of the conditional probability 
function of u given v, and for notational convenience this function will be 
indicated by pf; co\v) (rather than puf; 

Pointwise B-nonformation: For the observed values u and u, p(u; co\v) does not 
depend on co. ► 

Pointwise S-nonformation: For the observed values u and v the family 
{p{u; co\v): il/{co) — ij/} does not depend on if. ► 

Pointwise M-nonformation: For the observed values u and v, and for every value 
if, the family {p{-; (o\v): if(co) — if} has u as mode point, and the com- 
plementary family {p{-; (o\v): if{(o) ^ if] is universal. ► 



48 Logic of Inferential Separation, Ancillarity and Sufficiency 

Clearly, pointwise B-, S-, and M-nonformation for all u and v entails, 
respectively, S-, and M-nonformation. 

The reasoning behind the definition of pointwise M-nonformation, and hence 
behind M-nonformation, is the following. If, under the conditions specified in the 
definition, the observation u and the submodel {pf; co\v): coeQ} did contain 
available information on ij/ then it would be possible on the basis of their 
conjunction alone to say that some value ij/Q of ij/ is less credible than some other 
value. But this is not warranted because 

{a) there is perfect fit (or complete concordance) between u and the model 
{p(-;a}|r): ij/(a)) = i/^o} sense that u is a mode point of the model); 

(b) the alternative model {p(*;a)|u): \l/(a)) ij/o} will fit any value u perfectly; 

and because it is not scientifically reasonable to call a hypothesis (or model) which 
is in complete agreement with the data less plausible than an alternative 
hypothesis which is capable of explaining any of the possible outcomes. 

M-nonformation and pointwise M-nonformation are, clearly, related in spirit 
to the concept of plausibility. 

The main uses of the various notions of nonformation are in the processes of 
conditioning on ancillary statistics or margining to sufficient statistics, and many 
examples of these notions are contained in Sections 4.2 and 4.4 and in Chapter 10. 
Here, then, just two, somewhat special, examples will be given which illustrate 
that a statistic can be pointwise S- or M-nonformative without being globally so. 

Example 4.6. Suppose an individual (or item) is subjected to two kinds of events, 
occurring in two independent Poisson processes with intensities X and fx, 
respectively. The individual responds, e.g. dies (or the item is destroyed), at the 
moment both kinds of events have occurred. For each of n individuals inde- 
pendently, it is recorded which type of event occurred first and how long elapsed 
between that occurrence and the response. Let v be the number of individuals for 
which the first event that happened was from the Poisson process having the 
intensity A. Then v follows a binomial distribution with probability parameter 
ij/ = A/(/ + ^). Conditionally on u, the set of recorded time intervals is 
S-nonformative at v = n with respect to ij/ (on the proviso that A and /i vary 
independently, both in (0, co )). (Another way of expressing this would be to say 
that V is S-sufficient atv — n with respect to xj/, cf. the definition of S-sufficiency in 
Section 4.4.) ^ 

Example 4,7, Consider the 2x2 contingency table with one marginal given, as 
discussed in Section 4.1. If i/r is given by equation (2) of that section, then x. is 
M-nonformative with respect to that parameter, as will be shown in Example 10. 1 2. 
Suppose, however, that 


^ = Pi - Pi • 
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Then x. is not M-nonformative with respect to xj/ and it may indeed yield 
information on xj/. For instance, if x. is observed equal to 0 then there is reason to 
believe that \xl/\ is not very near to 1. On the other hand, for n^ — n 2 = n the 
realization n of x. is M-nonformative with respect to ij/. This follows from the facts 
that every distribution with + p 2 = 1 has n as mode point, which may be 
proved by remarking that any such distribution is symmetric around n and 
unimodal (by Theorem 6.6 it is, in fact, strongly unimodal), and that the subfamily 
of distributions of x. determined by ij/ — 0, being a binomial family, is 
universal ► 

In general, a statement that something contains no information or evidence 
with respect to a certain question is a relative statement, i.e. it is valid only on the 
proviso that certain other things are unknown (or not taken into consideration). 
Sometimes it is a far-fetched thought that such other things should become 
wholly or partly known, but in other connections it may be an obvious possibility. 
Typically, in the case where one of the definitions of (pointwise) S-nonformation, 
G-nonformation, or (pointwise) M-nonformation is satisfied it will nevertheless 
be possible to extract information on il/ from u, v, and the submodel if to these can 
be added further evidence on co, available for instance from some other, 
independent experiment or perhaps even from another part of the same model. It 
is particularly important to stress this in relation to G- and M-nonformation. For 
these, in contrast to S-nonformation, two independent observations of (w, v\ 
conjoined with the submodel for u given v does, as a rule, provide evidence on \j/. 
The difference between S-nonformation on the one hand and G- or 
M-nonformation on the other indicated here may be further illuminated by 
consideration of a situation where only one observation of (u, i?) is at hand but 
where one gets to know that co belongs to a subset Qq of Q such that for each value 
i/r there is precisely one co e Q with i/^(co) = \l/. This extra knowledge would clearly 
be of no help in the S-nonformation case, but would normally be relevant under 
G- or M-nonformation and could, for certain models, even show definitively 
which value of xj/ is the correct one. For the often encountered cases where co is of 
the form (x, xj/) with x and xj/ variation independent, these circumstances may be 
briefly summarized and stressed by saying that G- or M-nonformation means no 
available information on x]/ in the absence of knowledge on X‘ 


4.4 S-, G-, AND M-ANCILLARITY AND -SUFFICIENCY 

The general ideas of ancillarity and sufficiency were discussed in Section 4.1. In 
order to obtain a fully specified concept of ancillarity or suflSiciency one must give 
precise definition to the phrase that a submodel contains no information with 
respect to a specified parameter function. Various such definitions have been 
presented in the previous section. The most elemental of these definitions is that 
of B-nonformation which corresponds to B-ancillarity and B-sufficiency, i.e. 
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classical ancillarity and sufficiency. The properties of B-ancillary and B-sufficient 
statistics have, for practical reasons, been treated already in Section 4.2, and the 
present section is devoted mainly to illustrations of the ancillarity and sufficiency 
concepts derived from, respectively, S-, G-, and M-nonformation. 

Let t be a statistic and ij/ a subparameter, and consider the factorization 

( ! ) p(x; co) = pit; o))p(x; co\t) 

of the probability function for x into the marginal density for t and the 
conditional density given t. 

Assuming that ip parametrizes the conditional distributions, (1) may be written 

(2) p{x; CO) = pit; (o)p(x; 

In this case, the statistic t is ancillary with respect to ip if t) is nonformative 
with respect to ip (cf Section 4.1). 

On the other hand, assume that the marginal distributions of t are para- 
metrized by \p, so that (1) has the form 

(3) p{x: co) = pit; \p)pix; co|r). 

Then t is sufficient for x with respect to \p if x) is nonformative with respect to 
\p. 

More generally, if t is ancillary or sufficient with respect to \p, as above, it will 
also be called ancillary respectively sufficient with respect to any parameter 
function which depends on ip only. 

In connection with the discussion of S-ancillarity and S-sufficiency it is 
convenient to introduce the concept of a cut. Let t be a statistic. Each P e ^ may 
be broken into two pieces and i.e. one has a mapping on “ip into "iPr x 
given by P (P^, F). Now, t is said to be a cut if this mapping is actually onto 
X or, in other words, if any of the marginal distributions of t combined 
with any of the conditional distributions given t gives a probability measure in 
Clearly, if t is B-ancillary or B-sufficient then tis a cut. A cut which is neither 
B-ancillary nor B-sufficient is called proper. 

With ‘ip being parametrized and dominated, the fact that a statistic t is a cut 
means that for a suitable parametrization of ^ one has co = ix, ip) where x and \p 
are variation-independent components of co, and 

(4) p{x; co) = pit; x)pix; 

In relation to t, the components x and ij/ are spoken of as a corresponding pair of 
L-independent parameters (cf. the definition of L-independence in Section 3.3). 

Now, the concepts of S-ancillarity and S-sufficiency are obtained from the 
general stipulations of ancillarity and sufficiency by invoking the definition of 
S-nonformation. One sees that t is S-ancillary or S-sufficient (with respect to 
some parameter function) if and only if t is a cut. In that case and with x and i/c 
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as indicated by (4), t is S-ancillary with respect to ij/ (and any function thereof) and 
S-sufficient with respect to x any function thereof). 

The examples (3.2-3. 7) of L-independence are readily seen to be also instances 
of cuts, and several other instances appear in Sections 10.2 and 10.3 Suffice it 
therefore here to give just two further illustrations. 

Example 4.8. Poisson regression. In log-linear Poisson regression situations the 
total count is typically a cut. 

In the simplest example of such a situation the observation at the value t of the 
regression parameter is Poisson distributed with mean value 

e“^^ 

and independent observations are actually made at 

<t 2 <" ‘ < The probability of (xi , . . . , x„) is 

exp ]— 2 -exp jax. + 

! . . . X„! j-i J 

and X follows the Poisson distribution with parameter 

j=i 

while conditionally on x the variate (xi,,..,x„) has a (singular) multinomial 
distribution whose cell probabilities depend on jS only. Thus, if a and P are 
variation independent and if a varies in R then x is cut. 

More generally, suppose m series of observations have been taken at 
< ' • • < and that the parameter a may be varying from series to series. The 
distribution of x,j, the jth observation from the zth series, is then Poisson with 
mean value expla. + fkj} and the joint distribution of the observations may be 
factored into the marginal distribution of x.. which is Poisson with mean value 

m n 

Z Z 

i=l J=1 

the conditional distribution of x^. given x which is multinomial with cell 
probabilities 


(Ze“ 

\i=i 


(e“S...,en 


and, finally, the conditional distribution of x,,^ivgiajfe.fthis latter depending only 
on i?. It follows that both x.. and x,. are cuji^r'l^ed that a, and^re variation 
independent and that a. has R”' as doiMr^^i yariation/This was noted and used 
by Kalbfieisch and Sprott (1974) to sp^^e ^nalysis^ertain dilu^on |eries data 
from virological experiments into jiy^^paraj^^arts. ► 
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Example 4.9. Poisson process with log-linear trend. Suppose a Poisson process 

with intensity function 

X(t) == 

has been observed in the time interval [0, 7^. Let n be the number of events in this 
interval and let ti < < • - < t^ be the times at which they occurred. The 

distribution of n is Poisson with mean value 



and, given n, the vector L bas the density 



which, incidentally, is the density for the order statistic of a sample of n from the 
exponential distribution truncated to [0, T]. Consequently, for (a, jS) varying in 
JR^, the number of events n is a cut, and hence, in particular, inference on f should 
be performed conditionally on n, cf. Cox and Lewis (1966). ^ 

Next, a couple of general remarks concerning G«sufficiency will be made, and 
these will be followed by some examples of G-ancillarity and G-sufficiency, i.e. of 
factorizations as in (2) or (3) with t) respectively x) being G-nonformative. 
These examples may also be viewed as examples of M-ancillarity and 
M- sufficiency, cf. Section 4.3. Further illustrations of the latter two concepts will 
be mentioned subsequently. 

Suppose there exists a group G of transformations of 3E and a statistic t such 
that 

(i) For each value of the interest parameter if the corresponding family of 
probability measures is generated by G, and D is in one-to-one cor- 
respondence with 'P X G (in the obvious manner). 

(ii) The orbits of G are the elements of the partition of 3C induced by t. 

Note that, under (i) and (ii), if and t are corresponding maximal invariants 
which implies that the distribution of t depends on if only. It is thus clear that (i) 
and (ii) are nearly sufficient to ensure G-sufficiency. Furthermore, most examples 
of G-sufficiency satisfy (i) and (ii). (The properties (i) and (ii) are, in essence, the 
conditions required by Barnard (1963a) in his definition of ‘sufficiency of t with 
respect to if in the absence of knowledge of g {e G)’.) 

Ordinarily, the group G will be unitary, so that for each value of if the 
corresponding subfamily of ^ is a group family. 

Example 4.10. In the joint distribution of x and from a normal sample, with 
(^, <7^) unknown, the empirical variance s^ is G-sufficient with respect to and, 
moreover, x is G-ancillary with respect to cr^. ^ 
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Example 4.11. Let and X 2 be independent random variables, Xi following a 
gamma distribution with scale parameter jS, and known shape parameter Xi 
(f = 1,2). Thus Xi has density 


nxm^ 


-xl^^ 


{x > 0). 


Suppose X = Pill ^2 is the parameter of interest and let t = X 1 /X 2 . The marginal 
distribution of t is the generalized F-distribution with density 


r(l) 1 

a + t/xY"' 


Furthermore, the conditional distribution of X 2 given t is a gamma distribution 
with form parameter 2 and scale parameter 

1 + “]^ 2 * 

XI 

Hence if the domain of variation J5 of (/?i , ^ 2 ) is such that x and ^2 are variation 
independent with p 2 varying in (0, 00 ) — which is the case for B = (0, coff or 
j5 = {0 < < II 2 } — then t is G-sufficient with respect to x- ► 

Example 4.1 2. Consider the family of distributions of (x,L(x, — x)'(Xi — x)) where 
Xi , . . . , x„ is a sample of multidimensional normal variates. The distribution of the 
matrix r of empirical correlations depends only on the correlation matrix />, and r 
is G-sufficient for (x , E(Xi* -• x)'(Xi — x)) with respect to p. (Barnard 1966.) ► 

Example 4.13. Let be a sample from the k^dimensional von 

Mises-Fisher distribution with mean direction p and precision x Example 
8.1). The resultant length | | has a distribution which depends on the precision x 
only, and given \ v \ the distribution of the mean direction u /u | is the /c-dimen- 
sional von Mises-Fisher distribution having mean direction p and precision 
\v\x^ Thus, for freely varying (p, %), the resultant length is G-sufficient with 
respect to in the family of distributions of the (minimal B-sufficient) 
statistic V . ^ 

Example 4.14. A model for regression analysis of lifetime data is specified by 
assuming that the lifetime x of an individual with regressor covariate 
z = (zi , . . . , z^) follows a distribution with hazard function 

(5) 2o(x) e^'^ X G (0, 00 ), 

where ^ is an unknown parameter vector and /lo(-) is an unknown function which 
is not identically 0 over any open interval. For a fixed p the family of distributions 
having hazard (5) is generated from the distribution with hazard exp {/i • z} by the 
group of time transformations x ^(x) where g is differentiable and strictly 
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increasing. The connection between gf) and Ao(-) is given by 



Suppose now that a sample of n lifetimes Xi with covariates , . . . , is 

observed. Let G be the group of transformations of x = (xi, . . x„) of the form 

(Xi,...,X„)->(^(Xi),...,^(x„)) 

with g as above. The rank statistic t = ((1 ), . . . , fn)), w^hich gives the permutation of 
1, . . . , n such that x^, < X( 2 ) < • • • < X(„), is G-sufficient with respect to p. 

The regression model discussed here was proposed by Cox (1972) who also 
indicated the possibility of drawing inference separately on p. That the evidence 
about p may be isolated through G-sufficiency was shown by Kalbfleisch and 
Prentice (1973) who also derived the likelihood function for p based on the 
marginal distribution of t. This likelihood function is 

' 1 = 1 i=i 

A particular interest of the present example lies in the fact that Ao(f the 
incidental part of the specification when p is the parameter of interest, is of 
nonparametric character. ► 

Example 4.15. Two diailelic, autosomal loci may carry the genes G and g 
respectively T and t. A double heterozygotic individual is chosen at random from 
a population in which a proportion a of all heterozygotes is of cis type (chromo- 
somal arrangement GT/gt) while the proportion 1 — A is in trans (chromosomal 
arrangement Gt/gT). A cross between this individual and a double recessive ggtt 
yields n offspring which are classified according to genotype, giving the table 

GgTt Ggtt ggTt ggtt 
abed 

For a recombination probability of ti, this table has probability 

+ (1 - A)7i^^^(1- 


Thus 


X = a ^ d 

is B-suflficient and the probability of x is 


p(x; A,7 i) = ( {A(1 ^ + (1 - A) 7 r"(l - 
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The statistic 

y = min (x, n — x} 

has distribution 

piy; 7t) = |”| {(1 - - 7c)"~»} 

which does not involve A. Furthermore, the conditional distribution of x given 3; is 
concentrated on the two points y and n — y, and 


p{x-J.,n\y) 

Note that 

(6) p(x; A, TL\y) = pin - x; 1 - 2., n\y). 

Suppose X and n are variation independent and that n varies in (0,^]. (In most 
situations of genetical interest, values of the recombination probability greater 
than j are not a realistic possibility. Moreover, by restricting n to the interval (0, j] 
one ensures that different values of (X,n) yield different distributions for x.) 
Denote the domain of variation for X by A. 

If A = (0, 1) then, as follows immediately from (6) and the definition of 
G-nonformation, the statistic y is G-sufficient for x with respect to ti. 

Next, consider the case A = (0, j). This corresponds to an experimental 
situation where the base population, from which the double heterozygote used for 
the test cross was sampled, has arisen from a pure trans population by 
reproduction over an unknown number of generations. {X increases from 0 to ^ as 
the number of generations increases to infinity.) Here, G-nonformation does not 
hold, but y is still M-sufl5cient with respect to n, since 

p(x; A, 7i\y) j for X t 

Finally, suppose X is known to be very small, such as would be the case if the 
base population is an originally pure trans type population which has, acciden- 
tally, been contaminated with a few cis individuals. Then y is not M-sufilcient, and 
the conditional model given y may in fact yield strong evidence concerning n.To see 
this, note that if x > n/2 one has 


x(i ~ + (1 - Xffi^ii - nf-y 

(1 ~7iyn^~y + 71^(1 -nf-y 

(1 - X){\ - n)yTf-y + Xny{i - %f-y 
+ 71^(1 -nf-y 


for X = y 


for X = n — y. 



Thus, if the observed value of x is greater than njl then conditionally on y small 
values of % are strongly counter-indicated. ► 


c 
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A number of examples of M-ancillarity in exponential families will be exhibited 
in Section 10.5. Furthermore, all the instances of S-ancillary statistics mentioned 
in the foregoing are also M-ancillary, as is simple to see. Therefore just one other 
case of M-ancillarity is presented here. 

Example 4.16. Let Xi, . . . , x„ be independent and identically distributed, accord- 
ing to the negative binomial distribution which has point probabilities 

i = 0,1,2,.... 

Let the domain of variation for the parameter {x,n) be (0, oo) x (0, 1). 

The family of conditional distributions of x = (xi , . . . , x„) given x. is para- 
metrized by / alone, and x. follows the negative binomial distribution with 
parameter (7% For any fixed x the class of distributions of x. is universal and 
hence x. is M-ancillary with respect to ► 

Finally, after these exemplifications of S-, G-, and M-ancillarity and 
-sufficiency, a few, more general, points will be taken up. 

As mentioned in Section 4.1, there is in general no guarantee that dilTerent 
processes of separation, based on the various definitions of nonformation, will 
lead ultimately to the same submodel, to be used as the framework for inference 
on the interest parameter ij/. However, certain kinds of uniqueness hold under 
regularity conditions. 

Suppose that t and u are statistics such that for each fixed value of ij/ one has 
that t is B-ancillary and u is B-sufficient. Thus 

(7) p(x; 0 )) = pit; il/)p(x; o)\t) = p(u; ca)p{x; i//\u). 

Let it moreover be assumed that ij/ parametrizes both the marginal distributions 
for t and the conditional distributions given u. If it is known or can be proved that 

(a) X is a one-to-one function of (t, u), 

(b) for each fixed value of if the family of marginal distributions for u is 
boundedly complete, 

then, by (5) and Corollary 4.1, t and u are independent and hence the marginal 
model for t and the conditional model given u are, in fact, the same. As is simple to 
see, (a) and (b) are satisfied, in particular, if t is S-sufficient, u is S-ancillary, and ^ 
is boundedly complete (the uniqueness conclusion in this case was, in essence, 
given by Dawid (1975)). Assuming (b), condition (a) also holds if x is minimal 
sufficient, ^ is discrete and t is M-sufficient. To see this, note that, since t and u are 
independent, 

pix; (o) = pit; il/)piu; a))pix;a)\t, u) = piu;co)pix; \l/\u) 

and hence pix;o)\t,u) depends on co through if only. The assumption of 
M-sufficiency then entails, on account of Corollary 2.1, that the conditional 


(i - 
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distribution of x given (t,u) is uniform and since x is minimal sufficient this is 
possible only if (t,u) stands in one-to-one correspondence with x. 

Another type of uniqueness is ensured if a maximal ancillary or minimal 
sufficient statistic exists (see the end of the next section). 


4.5 QUASI-ANCILLARITY AND QUASI-SUFFICIENCY 

Two new notions termed quasi-ancillarity and quasi-sufficiency are introduced 
here, mainly because of their technical usefulness. These notions have a 
theoretical status between the general ideas of ancillarity and sufficiency, 
discussed in Section 4.1, and the specific ancillarity and sufficiency concepts 
designated by B-, S-, G-, and M-. 

The results of this section will be used in Section 10.1. 

Recall, from Section 4.2, that a statistic u is said to be conditionally B-ancillary 
given a statistic t if the conditional distribution of u given t does not depend on 

For any statistic t, let if/^ be a parameter function which induces the same 
partition of ip as does the mapping P /J, i.e. ij/t parametrizes the distributions of 
t. Then t is called quasi-sufficient with respect to ij/j. (and any function thereof) if 
any statistic u, such that t is a function of u and such that ij/^ induces the same 
partition of ^ as is conditionally B-ancillary given t. 

Similarly, let denote a parameter function inducing the same partition of ^ 
as the mapping P -> P\ The statistic t is quasi-ancillary v^ith respect to (and any 
function thereof) if for any statistic u, such that and ij/^ induce the same 
partition of ^ and such that u is a function of t, one has that t is conditionally 
B-ancillary given u. 

The content of the definition of quasi-ancillarity is this. Let t and u be statistics 
with u being a function of t, let be a parameter function and suppose the three 
mappings P ^ P\P P“, and }J/ all induce the same partition of Then p(x; co) 
factorizes as 

(1) p(x; co) = p{u; co)pit;\j/\u)p(x; \j/\t). 

If the conditional distribution of t given u depends effectively on \j/ then the 
conditional distribution of x given t cannot be said to contain ail the available 
information on ij/, i.e. t cannot be considered ancillary with respect to ij/. Thus 
quasi-ancillarity is a necessary condition for ancillarity. 

Example 4.17. If Xi and X 2 are independent Poisson variates having mean values 
Xi and and if the domain of variation of is given by X 2 ^^1 

(cf. Example 3.4) then x^ can, obviously, not be considered as ancillary with 
respect to Xi. However, as will be shown in Section 10.1, X2 is quasi-ancillary with 
respect to Xi, Quasi-ancillarity is therefore not a sufficient condition for 
ancillarity. ► 
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The raison d'etre tor quasi^sufficiency is similar to that for quasi-ancillarity. 

it will now be shown that both S-ancillarity and M-ancillarity, at least for 
discrete type families imply quasi-ancillarity. Let r, u, and \j/ be as in the 
paragraph preceding Example 4.1 7. If t is S-ancillary then it is a cut and hence the 
class of distributions of t for fixed ij/ does not depend on ij/. From (1) one sees that 
the conditional distribution of t given u cannot, therefore, depend on ij/, whence t is 
quasi-ancillary. Next, suppose t is M-ancillary. If there exists a factorization (1) 
such that pit:(o) = p(u:a))pit:\l/\u) is the density considered in checking the 
universality condition then, by the remark following Corollary 2.1, p(t;\l/\u) does 
not depend on ij/ (and r) which implies that t is conditionally B-ancillary given u. 
Thus, if every one of the possible statistics u allows a factorization of the kind 
mentioned, which is certainly the case if ^ is of discrete type, then t is quasi- 
ancillary. 

It is natural to consider the following general definitions of minimal sufficiency 
and maxima! ancillarity. Let the symbol □ stand for either B, S, G, M, or quasi. 
The statistic t is minimal CJ- sufficient wcr.t, the parameter function if if t is 
□ -sufficient w.r.t. if and if for any other statistic u, which is □ -sufficient w.r.t. ij/, 
onehas<T(t) c aiu) v 0. (Here, as in Section 4.2,0 is the class ofsets which have 
P-measure 0 for every Pe^.) Also, t is maximal Q-ancillary w.r.t. if r is 
□-ancillary w^r.t. ij/ and if for any other statistic u, which is □ -ancillary w.r.t. if, 
one has (j{u) c ait) v0. Minimal B-sufficiency and maximal B-ancillarity have 
already been discussed in Section 4.2. As was mentioned there, maximal 
B-ancillary statistics only rarely exist. However, for exponential families it is 
possible to establish a sufficient condition for maximal quasi-ancillarity (see 
Theorem 10.2j and, according to this condition, most of the S- and M-ancillary 
statistics mentioned in the examples of Section 4.4 and in Chapter 10 are 
maximal. 


4.6 CONDITIONAL AND UNCONDITIONAL PLAUSIBILITY 
FUNCTIONS 

Suppose again that the probability function for x factorizes as 

p{x;o}) = p{f,o})p{x;il/\t). 

In this section the relation between the unconditional and the conditional 
plausibility functions i7(cu;x) and iJ(i/^;x|t), in particular between the uncon- 
ditional and conditional maximum plausibility estimates of if, will be discussed; 
those estimates are denoted, respectively, by ^(x) - ifi&ix)) and if/(x\t). (A 
parallel discussion of the corresponding likelihood quantities would be possible 
but trivial since it would, essentially, require as an assumption that t is a cut.) 

It is not presupposed in what follows that t is (M-) ancillary (but the conditions 
in Theorem 4.7 below^ come close to implying M-ancillarity). 
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Let X be a mode point of the family p of probability functions for x. If c5(x) is 
non-empty and o) e oj(x) then 

p(x; (jj) > p{x;co) for all xedt 

whence, for ip = ipico) and t = t(x}, 

(1) p(x; iplt) > p(x; il/\t) for all x with t(x) = t. 

Therefore, if x is a mode point for p then 

(2) Mx) c 

Without the assumption of x being a mode point, (2) is not, in general, true as is 
simple to see by example. Note moreover that if both sides of (2) are one-point 
sets, which is often the case with continuous type distributions, then the 
unconditional and conditional estimates are simply equal (in contrast to what 
usually occurs for maximum likelihood estimation). In Theorems 4.6 and 4.7 
conditions will be given which are sufficient to ensure equality in (2), whatever 
the type of distribution. 

Theorem 4,6, Let x be a mode point of p and suppose that for all values ip there exists 
an m such that ip = ip{co) and ret(x(cu)) (where t = t{x)). 

Then 

lj/{x) = 
and 


(3) supi7(m;x) = niip;x\t\ i/re'F. 

C3\li/ 

(Since x is a mode point, /I(aj;x) = n{co;x) and n({p:x\t) = n{ip;x\t)). 

Proof It was shown above that ^(x) a ip(x\t) on account of x being a mode point. 
Thus, suppose ipeip(x\t). By Theorem 2.1, x is also a mode point for the 
conditional model given t and hence (1) holds. This allows the conclusion 

p(x; cd) > p(x; co) for all x, co with t(x) — t and \p(cD) = \p. 

Now, choosing co as indicated by the second assumption of the theorem and 
letting X denote a point in x(a)) for which t(x) = t, one obtains 


p(x; co) > p(x; co) = sup p(x; co), 

X 


which shows that co g c 5 ( x ), and hence \p e ^(x). 

The plausibility function for co may be rewritten as follow^s 


n(co;x) 


Pit; co)p(x; ij/\t) 
sup p(x; co) 
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pix.m 

sup p{x; il/\i) sup p(x;co) 


= n{\l/;x\t) 


sup p(x;cd) 

_xU ^ 

sup p(x;coy 


Thus, in order to verify (3) it must be shown that 

sup p(x;co) 

sup-^li— 

supp{x;co) 


but this equality is implied by the assumption made in the theorem. ► 

It may be noted that if the assumption in Theorem 4.6 is fulfilled for every mode 
point of p then ij/ and t are variation independent (where i{(D) = t(x{a))). On the 
other hand, if i/a and t are variation independent then Theorem 4.6 applies to every 
mode point x of p such that tet{Q). Further, see Theorem 4.7 below. 

Let x{ij/\t) denote the set of modes of the conditional distribution p(-; lAiO- 

Theorem 4J. Assume that p is universal Then the two conditions 

(i) \j/(x) = \l/(x\t) for all x 

fii) {x:xex[cD), \l/(oS) = i/a, t(x) = t} = x(\l/\t) for all values of i/a and t 
are equivalent. Moreover, (i) and {ii) and also 

(iii) sup n(w: x) = n(il/:x\t)for all values of ij/ and x 

are implied by either of the following two conditions 

(iv) i/a and i are variation independent, and r(D) = t{dc) 

(v) t and j/a are variation independent, and i/a{3E) = i/a, and (iv) and (r) are 
equivalent. 

The four conditions (/), {ii), {iv), and (v) are equivalent provided the range of ff{-\t) is 
^ for all values of t. 

Proof By the universality of p one has x e x{co) ocae f5(.x) and (cf. Corollary 2.1) 
xGx(}/A|f)o(/A6J^{x|t), from which the equivalence of (i) and (ii) as well as the 
equivalence of (iv) and (v) follows simply. Conditions (i) and (iii) are consequences 
of (iv), on account of Theorem 4.6. Finally, if the range of ^f\t) equals T then (i) 
implies (v). ^ 

If p is strictly universal then the relation f(0)= t(3£) in condition (iv) is 
automatically true. Note also that the precondition that j/A(-|t) has range 'F is very 
weak; it is, in particular, satisfied if Xt = {x;t(x) = t} is finite, cf. Section 2.2. 
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Example 4.18. Suppose x = , . . . , x^) where x^ , . , . , x^ are independent Poisson 
variates having mean values , . . . , 2^, and let t = x, and = (Aj/A., . . . , XJA}. 
To show that ij/ and t are variation independent, let x. and ij/ be given. For a.ny 
p > 0 hi Ajp) = Then Ai{p) is a nondecreasing function 

of p for all i and hence it is possible to choose a p such that [Ad/i)} + • • • 
+ U-mip)} = where the symbol [ } carries the meaning given in Section 1.1. 
Since, with A^p | as the parameter value, the set of mode points for the distribution 
of X is determined by ([Ad/i)}, . . . , [Am(p ) }) one has that x. e t(A*(/i)). Moreover, by 
construction, ipi/.dp)) = This establishes the variation independence of ij/ and L 
In fact, all the conditions of Theorem 4.7 are fulfilled. 

A simple derivation of Finucan’s characterization of the modes of a multi- 
nomial distribution, mentioned in Section 2.3(i) is now possible. On account of 
Theorem 4.7(ii), the set of modes of the multinomial distribution having 
trial parameter x and probability vector ij/ may be expressed as 

+ ■•• + [Am} = 

or, equivalently, 

0, [fiij/i} + ■ •• + {jxif/Jj: = x}, 

which is Finucan’s result. ^ 

Example 4.19. Let x = x** be an r x c contingency table of independent Poisson 
variates, let t be the set of marginals (x x*} and let ij/ — ?A** be the interaction 
parameter whose (fj’)th element is ij/ij = Otj — 0^. — Op -f 0 . where = In and 
Ajj is the mean value of Xjj. 

Again, all the conditions of Theorem 4.7, and hence of Theorem 4.6, are 
satisfied. For r or c equal to 2 it is possible to show this in a simple way by the 
kind of argument used in Example 4.18. A proof for general r and c has been 
established by Jensen (1976). The difficult part of the proof consists in showing 
that boundary points of {[/{x) are also boundary points of il/ix\t), which, in 
view of (2), is essentially what is needed to verify condition (i) of Theorem 4.7, 
Note that since, by (2), 


and, since, as just mentioned, the two uttermost estimates are equal, all the 
inclusions in (4) must be equalities, i.e. the maximum plausibility estimate of the 
interaction is the same whether one conditions on none, some, or both marginals. 
The estimate is given by 

\jj = lpiY[[Xij,Xij+ 1] 

\ 
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In the simplest case, r — c = 2, this estimate is 




In 


XnX22 


( Xi 2 + 1)(^21 + 


,ln 


(xn 4- 1 )(x22 + 1) 
^12^21 


An analogue, for r x c tables, of Finucan’s result is derivable as in the previous 
example <see Jensen 1976). This yields, in particular, a description of the mode 
points for the multivariate hypergeometric distribution (which occurs for 

ip = 0 ). 

The paper by Jensen (1976) also contains a discussion of (conditional) 
plausibility inference for contingency tables of arbitrary dimensions. In par- 
ticular, the usefulness of (3) is illustrated in that work. ^ 

For strongly unimodal, exponential families of continuous type the conditions 
of Theorem 4.7 are usually met (see Section 9.6). 


4.7 COMPLEMENTS 

(i) If, for an observed x and a parameter function ip, the likelihood function 
L(m) = p(x; oj) has the property that there exists a value \pof\p such that for any co 
it is possible to find an co with \p{(S) = \p and L(o5) > L (co) then it seems 
reasonable to speak of ip as a maximum likelihood estimate of \p even though a 
maximum likelihood estimate of co itself may not exist. Similarly for plausibility 
functions and other ods functions. 

Example 4.20. For a single observation from the normal distribution, where 
CO = (4 cr“), the likelihood function has no maximum but x is a maximum 
likelihood estimate of ^ in the above sense. ^ 

(ii) Model control. A rather often advocated way of controlling a proposed model 
^ consists in seeking out a statistic n, say, which is conditionally B-ancillary 
under the model and investigating whether it is tenable to consider u as following 
the exactly known, possibly conditional, distribution which this statistic has, 
according to the model and the conditional B-ancillarity. 

Thus, for instance, a specification of a location-scale model for a sample 
Xi,...,x„ may be controlled by testing that the B-ancillarity statistic 

(1) ~^n \ 

\Xi - X2’"*’Xi - X2/ 

has the parameter-free distribution prescribed by the model. 

The above procedure raises a number of interesting questions, such as to what 
extent the procedure is, in any given case, exhaustive and specific for the model 
control problem. Without attempting anything like a comprehensive discussion 
of those questions, a couple of points will be mentioned here. 

Suppose u is B-ancillary. It is then pertinent to ask whether x together with the 
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conditional model given u contains accessible evidence with respect to the 
question of validity of the model Similarly, if u = x and the considered 
distribution of x is that conditional on a (minimal) B-sufficient statistic t, one may 
inquire as to the possible controllability of the model on the basis of t alone. 

A model which is not universal will in general be controllable. For the two 
situations just considered it is therefore essential to inquire whether the 
conditional model given w, respectively the marginal model for is universal. In 
many cases the answer is affirmative. The location-scale example with u given by 
(1) is among these. 

Concerning the specificity of the model control it is of some interest to know 
whether the probability measures of the given model are the only ones which, 
under some mild general regularity conditions, assign to u the (conditional) 
distribution at hand, i.e. whether the distribution of u is characteristic for the 
model In a number of instances that is indeed the case, cf. the examples and 
references given below. It should however be kept in mind that such a 
characterization result may say very little of how sharp a check of the model the 
control based on the parameter-free distribution of u does yield. 

Example 4.21. If Xi , . . . , x„ are independent and identically normally distributed 
then 


s s / 

{wherex = x./nands^ — E(Xj +x)^/{n — 1)) follows the uniform distribution on 
the hypersphere in given by X = 0, Ewf = 1. The converse is true under 
weak regularity conditions (see Zinger and Linnik 1964). ► 

Example 4.22. Suppose Xi , . . . , x„ with n > 2 are independent, identically distri- 
buted, and positive random variates of continuous type and let 

i 

= E (Xj/x), i = - 1 

j = i 

and 

n-l 

M = - E 21ny,. 

i = l 

Then u has ;^^-distribution with 2(n — 1) degrees of freedom if and only if the x,- are 
exponentially distributed (Csdrgd and Seshadri 1970). 

Example 4.23. In a 2 x 2 contingency table 



Xi 


- Xi 


X2 


-- ^2 

712 


72, 

— X, 

n. 
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fwith X| and X 2 independent) the conditional distribution of Xj given x is 
hypergeometric if and only if Xi and X 2 are both binomially distributed with a 
common probability parameter. This is a consequence of a theorem due to Patil 
and Seshadri (1964) and Menon (1966). > 

For further results of a similar kind see Boiger and Harkness (1965), Bolshev 
(1965), Cs5rg5 and Seshadri (1970), Menon (1966), Prohorov (1966), and Rasch 
(1974). 


(iii) It is possible for the components of a partition (co^^\...,co^'”^) of co to be 
L-independent without this independence being induced by cut(s). 

Example 4.24. For a birth and death process, observed during the time interval 
[0, T], the likelihood function is 


and X and p are L-independent (see Example 3.8). If this L-independence 
corresponded to a cut then the cut would be S-sufficient for either X or p, say X. 
For fixed p the pair {b, z) is minimal sufficient (cf., for instance. Theorem 4.3), and 
ib,z) would therefore have to be such a cut, which is impossible since the 
distribution of {b, z) depends on p, as is apparent for instance from the formula 
(Puri 1968) 


Ez = ^— U 

k-ti 


(iv) Ancillarity-sufficiency; combination. The present subsection contains a brief 
and somewhat informal discussion of certain possibilities for extending and 
combining some of the more basic aspects of the considerations in Sections 
41-4.4. 

Let t and u be statistics, assume that r is a function of u, and let 
(2) = p{t;aji)p{u;co\t)p{x;a)\u) 

be the factorization of the probability function for x into the marginal density of t, 
the conditional density of u given t and the conditional density of x given u. 
Corresponding to this factorization the experiment yielding the observation x 
may be viewed as being composed of three successive experiments. The first 
experiment leads to observation of t. In the next experiment u is observed, the 
experiment being such that the distribution of u has probability function p{ • ; o)\t). 
Finally, the third experiment yields x with distribution p(-;ca|w). 

Suppose that p{u; a)\t) depends on the interest parameter ij/ only so that (2) has 
the form 


pix; co) = p(t; m)p(M; il/\t)p{x; co\u). 

One may ask then whether the information that the quantity t, which determines 
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the second experiment, was arrived at by a random experiment with distribution 
p{t; w) and that the second experiment was followed by a third with distribution 
p(x; (o\u\ is in its totality irrelevant as regards inference on ij/. 

The question will not be treated in any detail here. Suffice it to mention the 
rather obvious idea of contemplating combinations of B-, S-, G-, and 
M-ancillarity with B-, S-, G-, and M-sufficiency, thus introducing, for instance, 
the concept of M-B-nonformation defined as follows. The statistic (t, x) is said to 
be M-B-nonformative with respect to \jj provided: (i) for each ij/ the corresponding 
family of probability functions for t is universal; (ii) the conditional distribution of 
X given u does not depend on co. If (t, x) is M-B-nonformative then it is arguable 
that w) contains all the information on \jj given by {‘pjX). 

The above remarks are related to the paper by Cox (1975) which discusses a 
generalization of the ideas of conditional and marginal likelihood. 

(v) On Birnbaum's Theorem. Birnbaum’s Theorem states that the sufficiency axiom 
(iS) and the conditionality axiom (C) together implies the likelihood axiom (L) (see 
Birnbaum 1962, 1969). Here, (S) specifies that if t is a B-sufficient statistic then the 
statistical evidence on co contained in (^,x) is equivalent to that obtained in 

Oj (Q is the analogous specification for a B-ancillary statistic, while (L) says 
that the evidence is entirely conveyed by the likelihood function corresponding to 
the observed x. 

This result has caused much discussion because many statisticians have 
considered (C) and (S) as necessary, or at least acceptable, building blocks of a 
satisfactory theory of statistical inference, whereas they have found (L) unaccept- 
able for various reasons, the most prominent being the contradictions existing 
between the approach to inference which flows from (L) and the classical way of 
performing significance tests. The prototype of such contradictions is Armitage’s 
(1961) example, which builds on the fact that the likelihood function is 
independent of the stopping rule. 

The attempts by Durbin (1970) (see also Birnbaum 1970, and Savage 1970) and 
Kalbfleisch (1975) to eschew the problem by introducing modified versions of (S) 
and (C) and rules for the order in which these are to be applied, although 
interesting, have not yielded convincing and comprehensive solutions. 

As pointed out in Barndorff-Nielsen (1975), Birnbaum’s result may be 
paraphrased as saying that if it is set up as a requirement that application of the 
ideas of (B-) sufficiency and (B-) ancillarity does never lead to conflicting, or non- 
equivalent, conclusions then these conclusions have to obey the likelihood 
principle. However, as mentioned in Section 1.1, such a uniqueness requirement 
appears unwarrantable. 

(vi) Some non-uniqueness examples. A number of examples will be mentioned 
which show that different applications of various of the ancillarity, sufficiency, 
and nonformation concepts to one and the same statistical situation (^, x) may 
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and suppose that within each of the texts the words are fairly homogeneous. For 
such situations Rasch (1960) proposed the model which specifies that the t,j are 
independent with t,j being distributed as a sum of Uj independent exponential 
variates having parameter of the form SiSj. 

Set x,j = In tij, (Xi = In pj = In Sj and 

= Z = Z = Z 

j I 

a, = Z^i«.>/5. =Z'^A- 

with the and Wj being arbitrarily selected, positive weights which satisfy 

Z«,-=l, Z'^7=l- 

It was observed by Rasch that the distribution of x . depends on a. 4- only, 

while the distributions of the vectors (xi — x., ,X;, —xj and 

(x.i — X.., . . , ,x m -- x.j depend on (a^ — — a ) and 

ipi — — PX respectively; further, the distribution of the matrix 

[Xij — Xj. — X j + X .] is independent of all of the parameters. 

The model belongs to the general class of additive models 

(3) Xij = oCi -f Pj 4- Uij 

for which the matrix [u J of error variables has a known distribution. (In the 
particular case of the reading speed model, the Uij are independent and exp {uX 
follows a gamma distribution with shape parameter Uj and scale parameter 1.) 

It is immediately obvious from (3) that Rasch’s observation holds for any of the 
models in this class. (It may also be noted that for each of these models the family 
of distributions of x^^ is a group family.) 

Moreover, for any such model, the statistic (Xj — x ,...,Xi -- x.) is 
G-sufficient with respect to (ai — a,..., a/ — a). Similarly for x.. together with 
a 4- A > and for (x.i - x , x ^ - x ,) together with (j8i - , . . . , - PX 

Since [x^j — x^ — x j 4- x. J is B-ancillary, inference concerning the parameters 
should according to the ancillarity principle be drawn in the conditional model 
given [Xij — x^, — Xj 4- x J. This conditional model does (clearly) also belong to 
the general class considered here. Thus it is arguable that a proper distribution for 
inference on (ai — a,...,a^ — a ), say, is that of (xi. — x , .x^ — x.) given 
[Xij — Xj- — X j + X J. But this distribution is different from the marginal distri- 
bution of (Xi. — x..,...,Xi — X.) unless (xi. — x ,...,Xi — x.) and 
[x^j — Xj — X J 4“ X .] are independent; if the Uij are independent then the two 
distributions are equal only if the Uu are normally distributed, as follows from the 
Skitovic — Darmois theorem (see e.g. Linnik 1964) which states that inde- 
pendence of two linear combinations, with nonzero coefficients, of a set of 
independent random variates implies normality of these latter variates. 

As a basis for inference on (ai — a., . . . , a/ — a ) the conditional distribution of 
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(xi. “ X - xj given [x^; - x,-. - Xj 4- xj may be found preferable to 

the marginal, on the ground that the control of the model based on the ancillary 
statistic [x,j - x^. - Xj + xj appears as completely untangled as possible from 
inference on (xi - a , . . . , - a ) when the conditional distribution is employed. 

► 


4.8 NOTES 

The ideas of sufficiency and ancillarity were introduced, respectively, in Fisher 
(1920) and Fisher (1934, 1935). A precursor of the ancillarity idea had however 
been mentioned in the last section of Fisher (1925); this earlier idea is commented 
on in Efron (1975). It may also be added that the conditional, exact test for the 
2x2 contingency table was proposed in the fifth edition of Fisher’s Statistical 
Methods for Research Workers (1934). Fisher’s discussion of sufficiency in the 
papers Fisher (1920, 1921, 1925) has often been taken as being concerned with 
what has here been called B-sufficiency, but it seems that he may have had a more 
general concept in mind, rather like: a statistic t is said to be sufficient with respect 
to a parameter of interest if for any statistic u such that the joint distribution of t 
and u depends on ij/ only it holds that the conditional distribution of u given t is 
independent of ij/. 

In the mid-thirties, separate inference was also advocated by Bartlett (1936, 
1937). The first of these papers concerns conditional estimation, the second 
conditional testing. In the latter paper Bartlett presented: (i) the derivation of the 
conditional likelihood ratio test for the identity of the variances in k independent, 
normal samples; (ii) the viewpoint that control of whether a sample is normally 
distributed ought to be carried out in the, parameter-free, conditional distri- 
bution given (x,s^), and similarly for the exponential distribution; (iii) the 
viewpoint that test for the identity of k Poisson distributions, when one 
observation is available from each distribution, in principle ought to be carried 
out in the conditional distribution given the sum of the observations, and 
similarly for the binomial distribution; (iv) a discussion of the conditional 
likelihood ratio test for independence in the general two-dimensional con- 
tingency table. 

The cornerstone of the Neyman-Pearson test theory is the paper by Neyman 
and Pearson (1933) in which test power is proposed as the central criterion for 
evaluation of the quality of tests. In that work the concept of a similar test is also 
introduced and it is shown that the requirement of similarity implies, in a wide 
class of cases, that the test is composed of conditional tests with the same level as 
the test itself. However, although this result is much used for the construction of 
tests in the Neyman-Pearson approach (see, for instance, Lehmann 1959), it has 
never been part of that approach to perform the tests separately, in the 
conditional model. (Conditional inference given a B-ancillary statistic has been 
criticized by Welch (1939) on the ground that it may yield confidence intervals 
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which if viewed unconditionally are inefficient. Apart from the fact that this kind 
of comparison is begging the question, Welch did not, in his argument, allow the 
conditional procedure its full flexibility and accordingly his conclusion does not 
hold, cf. Barnard (1976).) 

A highly clarifying discussion of the meaning of ancillarity, in which the idea of 
mixture experiments was brought to the fore, was given by Cox (1958). 

The work by G. Rasch on what he has called measurement models and specific 
objectivity should also be mentioned as a very considerable impetus in the field of 
inferential separation (see Rasch 1960, 1961, 1968). 

The abstract theory of B-sufficiency built up primarily by Halmos and Savage 
(1949) and Bahadur (1954) is more general than is needed for the main part of 
statistical inference purposes. Moreover, it has turned out that, on this general 
level, the proposed definitions of sufficiency and minimal sufficiency do not 
possess certain basic properties which are, essentially, always met in applications. 
For instance, a minimal sufficient cr-algebra and a minimal sufficient statistic, in 
the sense of Bahadur (1954), may both (Pitcher 1957) or each one separately 
(Landers and Rogge 1972b) fail to exist. (A minimal sufficient cr-algebra always 
exists provided the family of probability measures is dominated, but even under 
this assumption it can happen that a minimal sufficient statistic, in the Bahadur 
sense, is not available, as demonstrated in the latter paper.) Moreover, a common 
regular conditional probability measure given a sufficient cr-algebra need not 
exist, even if the original cj-algebra is separable and each member of the family of 
probability measures admits a regular conditional probability measure given the 
sufficient cr-algebra, cf. Landers and Rogge (1972a). These problems, which from 
the statistical viewpoint are fictitious, do not occur within the framework for 
B-sufficiency discussed in Section 4.2, due to the following facts: (i) the sample 
space is Euclidean (it would have been sufficient to assume that the basic cr-algebra 
was separable); (ii) only sufficiency of statistics, and not sub-cr-algebras, is con- 
sidered, and statistics are defined as taking values in a Euclidean space; (iii) ^ is 
dominated by a cr-finite measure; (iv) a B-sufficient statistic is defined as one for 
which there exists a common regular conditional probability measure given the 
statistic (rather than in terms of conditional expectations, as in the 
Halmos-Savage-Bahadur approach). In essence, Theorem 4.1 is due to Halmos 
and Savage (1949) while Theorem 4.2 and Corollary 4.1 were given by Bahadur 
(1954). Theorem 4.3 and Corollary 4.3 have been presented in Barndorif-Nielsen, 
Hoffmann-Jorgensen, and Pedersen (1976). 

Basu (1959) has discussed the classes of ancillary statistics and events from an 
abstract viewpoint. In concrete cases it is often an intricate mathematical 
problem to determine these families. An outstanding classical example is that of 
finding similar tests for the Behrens-Fisher problem, see Linnik (1968). (The 
problem of finding the class of similar (nonrandomized) tests of a given 
hypotheses is the same as the problem of finding the class of B-ancillary events for 
a certain sub-family of 
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The term nonformation and the definitions of (pointwise) B-, S-, G-, and 
M-nonformation were iniroduced in Barndorff-Nielsen (1976c). 

The notions of S-sufficiency and S-anciliarity are due respectively to Fraser 
(1956) and to Sverdrup (1966) and Sandved (1967). Some properties of 
S-ancillarity were discussed in Sandved (1972). The definition of G-sufficiency is 
virtually equivalent to Barnard's (1963a) definition of a concept of sufficiency (cf. 
Section 4.4), and M-ancillarity and M-sufficiency were proposed in Barndorff- 
Nielsen (1973a, c). A critique of certain other proposals for definition of ancillarity 
or sufficiency may be found in Barndorff-Nielsen (1973a). 

The material on quasi-ancillarity and quasi-sufficiency, presented in Section 
4.6, is new. while the core of the material in Section 4.5 was presented in 
Barndorff-Nielsen (1976b) (see also Jensen 1976). 

The population to which a proposition is referred and by which its probability 
is determined was termed the reference set by R. A. Fisher. Conditional inference 
involves a change of reference set. An overview of Fisher’s ideas on the role of the 
reference set in statistical, particularly fiducial, inference is available in Pedersen 
(1976). 



PART 

II 


Convex Analysis, Unimodality, and 
Laplace Transforms 


A concise account is given of those parts of the subjects of convexity, unimodality, 
and Laplace transforms which will be invoked, for statistical purposes, in Part III. 
In particular, conjugate convex functions and certain convex duality properties 
are discussed. 




CHAPTER 5 

Convex Analysis 


5.1 CONVEX SETS 

The convex hull of a set M in will be denoted by conv M. The relative interior 
of a convex set C is denoted by ri C, the relative boundary of C by rbd C, and 
dim C will stand for the dimension of C (i.e. the dimension of the affine hull of C). 

Theorem 5.1. For any convex set C in cl(ri C) = cl C and ri(cl C) = ri C. 

For a proof see Rockafellar (1970), p. 46. ^ 

Let Ml and M 2 be arbitrary subsets of R^. A hyperplane H is said to separate 
Ml and M 2 if Mi is contained in one of the two closed halfspaces determined by 
H and M 2 is contained in the other closed halfspace. H separates Mi and M 2 
properly if Mi and M 2 are not both contained in JT. It is said to separate M 1 and 
M 2 strongly provided it separates Mi and M 2 and provided Mi and M 2 are both 
at a positive distance from H. 

Theorem 5.2. Let Ci and C 2 be convex sets in R^. In order that there exists a 
hyperplane separating Ci and C 2 properly it is necessary and sujficient that riCi 
and ri C 2 have no point in common. 

For a proof see Rockafellar (1970), p. 97. ► 

Theorem 5.3. Let M be a subset of R^. If xe conv M then x can be written as a 
convex combination ofk-\-l points of M. Moreover, ij xGint(conv M) then there 
exists a natural number m<2k and points Xi, X 2 ,...,x^ in M such that 
xeint conv {xi, . . . ,x^}. 

The first assertion is, of course, Caratheodory’s theorem. For a proof of the 
second assertion see Valentine (1964), p. 41. ► 

Let X be a point of a convex set C. A vector x* is said to be normal to C at x 
provided 

(z — x) • X* < 0 for all zeC. 

The set of vectors which are normal to C at x form a convex cone Kix) called the 

73 
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normal cone to C at x. Clearly, K(x) = [0] for x e int C while K{x) contains a half- 
line for every xeCdntC. Defining the normal cone to C at a point x^C by 
K(x) = 0 one has then established a mapping K on all of jR^', the normal cone 
mappimy 

Any halfline or nonzero vector in determines a direction of Formally, a 
direction of is defined as an equivalence class of closed halflines in i^^ two 
closed halflines being equivalent if they are translates of each other. A convex 
subset C of R^ is said to recede in the direction D if C includes all the halflines in the 
direction D which start at points of C. The set consisting of the zero vector and of 
all the vectors which determine the directions of recession of C is a convex cone ; it 
is called the recession cone of C and is denoted by 0^ C (cf. Rqckafellar 1970). If C 
is closed or open and if it contains some closed halfline in the direction D then it 
actually recedes in the direction D. Moreover, for any convex set C one has 

(1) 0'"(riC) = 0'"(clC)=)0-"C. 

The recession cone of a closed convex set is closed and if is a closed convex cone 
then 

( 2 ) 0-^K = K. 

Let C be a convex set in R^ and let x* be a nonzero vector in R^. C is said to be 
bounded in the direction determined by x* provided there exists an a e R such that 
X - < a for every xeC. The barrier cone of C, denoted by bar C, is the set 

consisting of the zero vector and of the vectors in the direction of which C is 
bounded. Any such cone is convex. For any convex set C 

1 3) bar C = bar(ri C) = bar (cl C). 

The polar of a convex cone K is the convex cone defined by 
X® = {x*:x*< 0}. The polar is closed and 

(4) K^^=clK. 

Theorem 5.4. The polar of the barrier cone of a closed convex set C is the recession 
cone of C ; in symbols 

(5) (barC)^ =0-"C. 

For a proof see Rockafellar (1970), p. 123. ► 

Theorem 5.5. A closed convex set C in R* is bounded if and only if its recession cone 
consists of the zero vector alone. 

For a proof see Rockafellar (1970) p. 64. ► 

If C is a convex set, the set ( — 0 C) n 0"^ C is called the lineality space of C and is 
denoted by line C. The set lin C is the largest subspace contained in 0“^ C. It 
follows from (1) that 

( 6 ) 


lin (cl C) = lin (ri C) z? lin C. 
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One has (cf. Rockafellar 1970, p. 65) 

(7) C = lin C + (C n (lin C)^). 

Theorem 5.6. Suppose K is a closed convex cone. Then 

(8) (linR:)^ = affK:‘^. 

This result is a special case of Theorem 14.6 in Rockafellar (1970). ► 

Theorem 5.7. Let C be a non-empty convex set. Then 

(lin (cl C))^ = (lin(ri C))^ = aff(bar(cl C)) = aff(bar(ri C)). 

Proof. By (6) and (3) it suffices to show that if C is closed then 

(lin C)^ = aff(barC). 

We have 

aff(barC) = aff(cl(barC)) 
and (cf. (4) and Theorem 5.4) 

(0-"C)^ =cl(barC). 

Hence, applying Theorem 5.6, we find 

aff(barC) = aff(0'"C)^ 

= (linO^C)^ 

It thus remains to prove that 

(9) lin C = linO'^C. 

Since C is closed, O'^'C is a closed convex cone and hence O^(O^C) = O'^C, cf. (2). 
The equality (9) now follows from the definition of lineality space. ► 

A face of a convex set C is a convex subset F of C such that any (closed) line 
segment in C with a relative interior point in F has both endpoints in F. 
Obviously, 0 and C are both faces of C; any other face is called proper. 

A point of C is an extreme point if it cannot be expressed as a convex 
combination of two other points in C which are both different from the former. 
The extreme points of C coincide with those faces of C which consist of precisely 
one point. 

The collection of the relative interiors of the (non-empty) faces of a convex set C 
constitutes a partition of C; in other words, every point in C belongs to precisely 
one of those relative interiors, (cf. Rockafellar 1970, p. 164). 

Theorem 5.8. Let C — conv M, where M is a subset of and let F be a non-empty 
face of C. 

Then F = conv(M n F). 

For a proof see Rockafellar (1970) p. 165. ► 
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A convex set C is a polytope if it is the convex hull of finitely many points in R^, 
Any proper face F of a polytope C is of the form F = H nC where H is a 
supporting hyperplane of C. 

The affine mappings considered in the remainder of the present section are all 

on into R^. 

Lemma 5.1. If a is an affine mapping and C a convex set in R^ then a{C) is convex and 
ri a(C) = ^(ri C), cl aiC) 3 a(cl C). 

For a proof see Rockafellar (1970), p. 48. ^ 

The next three lemmas are simple to prove. 

Lemma 5.2. Any affine mapping a is of the form a = poao + b where Oq is a 
regular linear mapping, p is a projection and b a translation. 

Lemma 5.3. If a is a regular affine mapping and M an arbitrary set in R^ then 
a(conv M) = convn(M), ^(clconvM) = clconV(3(M) and a(riconvM) = 
ri conv a{M). 

Lemma 5.4. If p is a projection and M an arbitrary set in R^ then 
picom M) = conv p(M), cl p(cl conv M) = cl conv p(M) and p(n conv M) = 
ri conv p{M). 

On combining Lemmas 5. 1-5.4 one obtains 

Theorem 5.9. If a is an affine mapping and M an arbitrary subset of R^ then 
a(conv M) = conv a{Mf cl a(cl conv M) = cl conv a(M) and a(ri conv M) = 
ri conv a{M). 

5.2 CONVEX FUNCTIONS 

Let / be a function defined on a subset D of R^ and with values in [ — 00 , 00 ]. The 
set 


{(x, r}):xGD,riE RJ{x) < rj} 

is called the epigraph of / and is denoted by epi/. The function/ is convex if / does 
not take the value - x and if epi /is non-empty and convex as a subset of R^^\ 
(Rockafellar (1970) uses a slightly wider definition of convex function in that he 
allows - X as function value and also he does not require epi/ to be non-empty. 
The definition used here coincides with Rockafellar’s definition of a proper 
convex function.) This definition is equivalent to the requirement that D is convex 
and that / is finite for at least one xeD and satisfies 

/((I X)x + Xy) < (1 - X)f(x) + Xfiy) 

whenever xeD^ysD and 0 < A < 1. As testified by Rockafellar’s writings there 
are great technical advantages in allowing + x as a value in the definition of 
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convex function. Note that a convex function / can always be extended to a 
convex function on all of jR* by setting j{x) = + x for x^D. 

The ejfective domain of a convex function / is the set {x: f(x) < x}, which is 
convex. This set will be denoted by dom /. Thus, if / is a convex function on 
then domain f = while dom/ is, in general, a genuine subset of 
It is simple to prove the following two theorems: 

Theorem 5.10. If f is a convex function on R^ and Aisakxm matrix then the 
function f defined on R^ by 

fiy) = inf{/(x):xA = y} 

is convex, provided it nowhere takes the value — x. (Even if — co is a value of f this 
function is still convex in the extended sense that epi / is convex.) 

Theorem 5. 11 Let f be a convex function on R^ and let M be a subset of R^. Then 

sup{/(x):xeconvM} = sup {f{x):xe M} 

and the first supremum is attained only when the second {more restricted) supremum 
is attained. 

A function g on R^ into [ — x, x) is concave provided is convex. 

Example 5.1. Let ^ be a concave, everywhere finite function on R^, and let /z be a 
non-increasing convex function on R. 

Then f = hog is convex because 

/((I - X)x + Ly) <h{{l - X)g{x) + Lgiy)) 

^ (1 “ >^)/M + Lfiy) 

for every x, ye dom / and >l6[0, 1]. ^ 

A function / on R^ into ( — x, x] is said to be quasiconvex if for every ae R the 
level set {x:/(x)<a} is convex. (It is obvious that any convex function is 
quasiconvex.) 

Theorem 5.12. A function f on R^ into ( — x, x] is convex if and only if for every 
X* G R^ the function 

f{x) — x*-x, xeR^ 

is quasiconvex. 

Proof. The only if assertion is trivial. 

Suppose / is not convex. Then there exist Xq , x i g dom / and >1 g (0, 1 ) such that, 
letting ao =/(xo) and aj = /(xO, 

/((I — A)xo + Axj) > (1 — A)ao + Aa^. 
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Determine x* e and c^e R so that 

— Xq) = OCi - Xq 
x*.Xo + Co = ao* 

Then the graph of the affine function 

X* - X + Co, X e JR* 

contains the points (xq, ocq) and (xj, Xi), and hence Xq and Xj belong to the level set 

{x:/(x) — x**x < Co 

But (1 ~ A)xo + >^-Xi is not an element of this set and thus 

/(x)— x-x* xeR* 

is not quasiconvex. ► 

Recall that a function / on a set D( c jR*) into ( - co, cc] is, by definition, lower 
semi-continuous at a point x g D if 

/(x) = liminf/(y). 

If/ is any convex function on R* then the closure (in of its epigraph is the 
epigraph of a certain function which is called the closure off and is denoted by cl/, 
cl / is a convex and lower semi-continuous and cl/ =/if and only if / is lower 
semi-continuous. / is said to be closed if / = cl/. In any case cl / agrees with / 
except perhaps at relative boundary points of dom/. Moreover, we have: 

Theorem 5J5. Let f he a convex function on R*. Then for every XGri(dom f) and 
every ysR^ 


(cl/)(y) = lim/((l - X)x -h Ay). 

ATI 

Corollary 5.1. For a closed convex function f on R* one has 
/(y) = lim/((l - A)x -h Ay) 

ATI 

for every xedomf and every y. 

For proofs of Theorem 5.13 and its corollary see Rockafellar (1970), p. 57. 

► 

Theorem 5.14. Let f be a closed convex function on R*. 

The non-empty sets among the level sets {x:/(x) < a}, aGR, are closed and 
convex and they all have the same recession cone. 
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For a proof see Rockafellar (1970), p. 58 and p. 70. ► 

A convex function / on is polyhedral if its epigraph is the intersection of a 
finite collection of closed halfspaces of Such a function is closed, and it 
attains its infimum provided it is bounded below (cf. Rockafellar 1970, p, 268). 

The recession function of a convex function / is the convex function whose 
epigraph is O'^lepi /). It is denoted by /O'^. 

Theorem 5.15, Let f be a convex function, and let y be a vector. If one has 

lim inf f(x + 2,y)<co 

00 

for a given x, then x actually has the property that f{x + Ay) is a non-increasing 
function of X, -oo < A < co. This property holds for every x if and only if 
(/0‘^)(y) < 0. When f is closed, the property holds for every x if it holds for even one 
xedom /. 

For a proof see Rockafellar (1970), p. 68. ► 

Example 5.2. For any open subset C of R5 there exist closed convex functions / 
such that domf = C and 

(1) lim inf f{x -f- Ay) = oo 

X-* (X) 

for every x and every y 7^ 0. This will be shown here by exhibiting a concrete 
example of such a function. 

Nothing is lost by assuming 0 g C. Moreover, if C = R* then the function x • x is 
of the desired kind, so suppose C ^ R^. 

Let d be the function on R^ such that d{x) is the distance of x to the boundary of 
C provided x e C, while d{x) is zero otherwise. Then d is continuous and concave. 
The latter assertion may be proved as follows. Let Xq , x j g C, let Si be the sphere 
with centre x^ and radius d(Xi), i = 1, 2, and set M = bdconv (5o ^ Si). For any 
Ag( 0, 1) one finds 

(i((l — A)Xq + Axj) > inf || (1 — A)xo 4- Axi — z\\ 

zeM 

= (1 — A)d!(xo) + Ad(xi). 

Now define / by 

/(x) =x-x + l/d{x), xeR^. 

This function is closed convex and it clearly satisfies (1) for x = 0 and hence, by 
Theorem 5.15, for all x. ► 

Let / be an arbitrary function on R^ into ( — 00, 00] (which is not identically 
+ 00). The convex hull of f is the function conv / on Rf^ defined by 

(2) (conv /) (x) = inf {y] : (x, rj) e conv epi / }. 

This function is convex, provided it does not take the value — 00, and it is the 
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greatest convex function </. Set S = domf and let S' consist of the points 
{xj{x}y xeS and the direction of (0, 1) (where 0 g R\ 1 6 R). It is simple to see that 

(3) conv epi/ = conv S' 

where conv S is defined by conv S' = conv S + {/.(0, 1): 0 < a < x}. 

Theorem 5J6- Let f be an arbitrary function on R^ into ( — x, co]. 

One has 

Ck+l k+l 

(4) (conv/)(x) = inR ^ Y A.-x, = x 

(.1 = 1 1 = 1 

where the infimum is taken over all expressions of x as a convex combination of 
k 4- 1 points in R^. 

For a proof see Rockafellar (1970), p. 157. ► 

Theorem 5.17, Let / be a function on R'" into (— x, x], set S = dom / and let S' 
consist of the points (x, /(x)), x e S, and the direction (0.1). Suppose S is a finite set. 
Then conv / is a closed convex function and epi (conv/) = conv S'. 

Proof conv S' is a closed set and 

(conv/)(x) = inf {rj:(x,rj)econy S'] 

cf. (2) and (3). The theorem now follows at once. ► 

Let : oj 6 0} be an arbitrary collection of closed convex functions on R^ and 
set 

J = sup CO eO.} 

the supremum being taken pointwise. Then / is also a closed convex function, 
unless it is identically + x. 

5.3 CONJUGATE CONVEX FUNCTIONS 

The conjugate of a convex function / on R!" is the function /* on R^ defined by 
/*(x*) = sup (x • X* - /(x)), X* 6 R*. 

Clearly, for any convex function / on R^ 

(1) x-x* < /(x)+/*(x*), xgR\x*gR^ 

This inequality is called Fenchers inequality. 

Theorem 5.18. Let f be a convex function on R^. The conjugate is then a closed 
convex function on Moreover, (clff =/* andj^^ = cl/. Thus, in particular, 
f = f^^ if and only if f is closed. ^ 

For a proof see Rockafellar (1970), p. 104. 
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More generally than the first assertion of Theorem 5.18 one has that for any/ 
on iiito ( — 00 , oc] the function /* defined on by 

/*(x*) = sup{.x-x* ~ fix)} 

is closed convex (unless it is identically co) because it is the supremum of closed 
convex (in fact, affine) functions. Whether /is convex or not,/* will be called the 
conjugate of/. Using Theorem 5.16 it is simple to see that 

(2) /* = (conv/)*. 

Examples of conjugate pairs of closed convex functions are: 


Example 5.3. If 

f{x)=-\x\P, XGjR, 

P 

where 1 < p < cc, then 

/*(x*) = — |x*|/ X* G jR 

q 

with 1/p + 1/^=1. ^ 

Example 5.4. Suppose / is a positive semi-definite quadratic form on 

f(x)=^xcx\ xgR^ 

c being a symmetric, positive semi-definite k x k matrix. 

If c is non-singular then 

/*(x*) = -^x^c ~ ^ X*', X* 6 R^. 

Thus, in particular, for the function /o(x) = ^x . x, x e R^, we have /o = /o • This 
function is the only function on R^ which is equal to its own conjugate. In fact, any 
such function satisfies 


X . X < y (x) -f /*(x) = 2/(x) 

and hence /> /o which in turn implies /* < /g. Since /o = /g and, by assump- 
tion, / =/* we must have / = /q. 

In general, for c arbitrary positive semi-definite 

Ux*CX*' X*Gl 

/*(X*) = ] 

[-1-0) X* ^ L 

where L is the orthogonal complement of the subspace {x : xc = 0} and where c is 
the unique symmetric positive semi-definite k x k matrix satisfying cc = cc = p 
with p the matrix of the orthogonal projection of R* onto L. ► 
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Example 5.S. Let J be defined on R'^ by 

fx, lnxi + ■• • + forxeA 

/(-v-) = 

i + X for X ^ A 


where 

A = !x:x = (Xj, . . . , x^),x, > 0 ,. - . , x^ < 0, Xj + • • • + X, = 1} 

and where 0 In 0 is interpreted as 0. Then / is closed convex and the conjugate off 
is given by 

y *(x*) = In (ex* H ex^), x'^ e 

For a proof see Rockafellar (1970X pp. 148-149. ► 

The relationship between the effective domains of a convex function / and its 
conjugate depends heavily on the behaviour of / and consequently very little can 
be said in general about this relationship. There is however one simple and useful 
result in this area, noted by Fenchel (1953), p. 93. 


Theorem 5.19. Let f be any convex function on R^. Then the barrier cone of dom / 
is contained in the recession cone of dom / * ; in symbols 

bar (dom f) a 0^(dom /*). 

Proof. Suppose yebaridom /), y # 0 and x^edom /*. We have to show that 
+ Ay:A > 0] e dom /*. Now, both (y,0) and — 1) belong to the barrier 
cone of epi / and since barrier cones are convex 

(x*, - 1 ) + A(y, 0) = (X* + A>’, — 1) e bar (epi /), a> 0 

which implies x*^ + ).yedom for every A > 0. ► 

That the inclusion in Theorem 5.19 may be strict can be seen e.g. by taking 
/(x) = ix - x, xeR^. In this instance bar (dom/) = (0} while 0"^ (dom /*) = R^. 


Theorem 5.20, Let f be a closed convex function on R^. 

In order that {x:f (x) < a} be a bounded (as well as closed, convex) set in R^for all 
oleRU is necessary and sufficient that 0 e int (dom/*). In this case the infimum of f 
is attained. 


For a proof see Rockafellar (1970), p. 123 and p. 265. ► 

The indicator function of a convex set C in R* is defined by 

ro forxeC 

3(x\C)= i 

I X) for X ^ C. 
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The function <5(-|C) is obviously convex, and it is closed if and only if C is closed. 
Its conjugate <5*(jC) is equal to the support function of C, i.e. 

(5*(x^'|C) = sup{x- x^ixeC}, x'^eR^. 

Theorem 5.21 • Let f he a convex function. The support function of domf is then the 
recession Junction of Ij f is closed, the support Junction of dom/* is the 
recession Junction fO^ of J. 

For a proof see Rockafellar (1970), p. 116. ► 

Let /i Jm be convex functions on R^. The infimal convolution of/i , . . . is 

defined on R^ by 

y j □ y. □ • • • □ JmM = inf {fiiXi) + /2(X2) + • • • 

+ Jm(Xm):Xi + X 2 + • • • + X^ = x] 

and is again a convex function, provided it does not take the value — X) for any 
xeR^. The operation of infimal convolution is, essentially, dual to that of 
addition of convex functions. In fact 

(/i □•••□/J* =/?+•••+/* 

and if the sets ri(dom/), z = l,...,m, have a point in common then 
(3) (/l + ••• H-/J* =/?□ •••□./fn, 

cf. Rockafellar (1970), p. 145. 

Consider a convex function / on and denote its argument by (u, v) where 

tz € R^ v€ R^. The function 

/i(w, i;*) = sup {u .i;* — /(u, f)}, iz e , y* e jR"* 

i; 

is called a partial conjugate of /. For each fixed u it is either a closed convex 
function of y* or identically — oo. And for each fixed y* it is, on account of 
Theorem 5.10, either concave or, if it takes the value + oo, concave in the extended 
sense indicated by that theorem, (h is a so-called saddle function, cf. Rockafellar 
(1970), p. 349.) The set of points (w, y*) for which h{u, y*) is finite is denoted dom h. 
Set 


D = {u: (li, y) G dom / for some v} 

E = {y*: (tz*, y^’Oedom/* for some iz*}. 

It will now be shown that 

(4) viD X E cz dom ha D x cl E 

provided / is closed. 

Note first that the functions /„(•) = /(tz, fueD, are closed convex functions on 
which all have the same recession function. Hence, by Theorem 5.21, the 
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effective domains of the conjugates of these functions must be equal, modulo 

relative boundary' points. Moreover, for ueriD one has 

15) dom(/„f=£. 

To see this, observe that r‘'*'‘edom(/J*ifand only if (w,r*)edom(/ + 3i‘\u x R^))^ 

and that 

(/-f X = (J'-^ □ d%\u X R^)){u,v^) 

= inf — u> iC] + ii>u. 

u* 

It is now simple to verify (4). 

In the case where D and E are both open then 

16) domh — D X E. 


5.4 DIFFERENTIAL THEORY 

Let / be a convex function on R^ and let x* be a vector in R^. Then x* is said to be a 
siibgradient of / at a point x if 

(1) f(z)>f(x) + {z — x)-x'^ VzeR^ 

The set of all subgradients of / at x is called the subdiff'erential of / at x and is 
denoted by c/(x). The (possibly) multivalued mapping df:x-^df{x) is the 
subdilferential of / and /is said to be subdiffer entiable at x provided dj(x) i=- 0. 
Obviously 

(2) = x-x* — /(x)<^>x*6d/(x). 

In other words: for each x*gR^ the concave function /(qx*) on Rf defined by 
l{x:x'^} = X . x**' - /(x), X G R^, attains its supremum at x if and only if x* e df{x). If 
/is closed then the latter condition is equivalent to x g 5/*(x*). Consequently, the 
mappings cf and c/* are each other’s inverse provided / is closed. 

The domain of cj, i.e. the set {x: d/(x) ^ 0}, will be denoted by dom 8f, 

Example 5.6. Consider the indicator function b(-|C) of a convex set C in R^. It is 
obvious from the defining relation (1) that d^(x|C) = 0 if x ^ C, while for x g C the 
vector X* is a subgradient of 5(-|C) at x if and only if 0 > (z - x) • x* for every 
zeC, i.e. x* is a normal to C at x. Thus 

dS(‘\C) = K 

where K denotes the normal cone mapping on C. ► 

Theorem 522. Let f be a convex function on R^. Then 
ri(dom /) c dom df c dom /. 
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A point X in the relative boundary of dom / is not contained in dom df if and only if 

fix + ;.(xo - x)) - fix) 

— .X) as A lO 

for one — and hence every — point .Xoeri(dom /). 

The range of df is contained in dom/*. 

The proof of this theorem is easy to deduce by means of Theorems 23.4, 23.2, 
and 23.3 on pp. 216-217 in Rockafellar (1970). ► 

Theorem 5.23. Let f be a function on into ( — x, x] for which dom/ is finite. 
Set f = conv f. 

Then dom cf — dom / . 

Proof. Set S = dom/ and let S' consist of the points (x,/(x)), xeS. and the 
direction (0, 1 ) (where 0 e 1 g R). From Theorem 5.17 we have conv S' = epi/ . 
Consider a point XGdom/ and set 

M = {{x,t]):r] </(x)}. 

Then conv S' n ri M = 0 and hence, by Rockafellar (1970) Theorem 20.2, there 
exists a hyperplane separating conv S' and M properly and not containing M. Such 
a hyperplane is obviously a nonvertical supporting hyperplane to epi / at (x,/ 
(x)) and consequently d/(x) # 0. 

Theorem 5.24. Let fi,...,f^ be convex functions on R* and / = /i + • • • 4- /m* 
Then 


dfix) 3 dffx) + • • • + dfjx) VxgR^ 

If the convex sets ri(dom fi},i= 1,2,..., m, have a point in common, then actually 
5/(x) = dfiix) + • • • + df„ix) V X e 

For a proof see Rockafellar (1970), p. 223. ► 

Corollary 5.2. Suppose /i , ... ,/m are closed convex functions on R^ such that the sets 
ri(dom /f ), i = 1, . . . , m, have a point in common. 

If Xi,...,Xjn are points in R^ for which 

dfi(xi) n...ndfjxj 7^0 

then 


(/,□•••□ /„)(x ) = /i(x0 + ■ • • + UXm) 

(where x. = Xi + h xj. 

Proof. Let x* 6 dffxi) n . . . n df„(x„}, then 

X, 6 5/t(x*) + • ■ • + dfZ(x*) = d{f* + ■■■+ f*)(x*) 
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and hence, by (3| of Section 5.3 and (2) above, 

|/l □ • ■ • □ = if I + h ) 

= x.x*-(/t+ -••+/*)(x*) 

= + ' * ' + ► 

The following theorem describes the relationship between the concepts of 
sobgradient and gradient of a convex function. 

Theorem 5.25. Let fhea convex function on and let x be a point in R^. Then cf (x) 
contains exactly one element if and only if f is finite and differentiable at x (which 
implies xemtidomf)). In this case the element is Df{x). 

For a proof see Rockafellar (1970), p. 242. ^ 

The entire subdifferential mapping df can, in fact, be constructed from the 
gradient mapping Df when / is a closed convex function with int(dom /) ^ 0. 
Specifically: 

Theorem 5,26. Let f be a closed convex function with int(dom/) 0. Then 
dfix) = cl (conv M{x)) + K{x) V x 6 

where M(x) is the set of all limits of sequences of the form Dfixff Df(x 2 ), . . . such 
that j is differentiable at Xj and x,- tends to x and where K{x) is the normal cone to 
dom / at X. 

For a proof see Rockafellar (1970), p. 246. ► 

Consider a convex function / on for which int(dom/) 0 and / is 
differentiable throughout int (dom/). Such a function will be said to be steep at x, 
where x is a boundary point of dom/, if 

\Df(xf)\ 00 

whenever .x^, X 2 ,... is a sequence of points in int (dom/) converging to x. 
Furthermore, / will be called steep if it is steep at all boundary points of dom/. 

Theorem 5.27. Let fbea convex function on R* such that int(dom/) =# 0 andf is 
differentiable throughout int (dom /). 

Then f is steep if and only if 

d 

— /(x + A(z - x)) i - 00 as 2 i 0 

for any ze int (dom /) and any boundary point x of dom/. 

For a proof see Rockafellar (1970), p. 252. ^ 
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Corollary 5J, Let f be a closed convex function on such that domfis open and f 

is differentiable (on dom /). 

Then f is steep. 

Proof Use Corollary 5.1. 

A convex function / on is essentially smooth if int(dom f) 0. f is 
differentiable on int(dom/) and steep. 

Theorem 5.28. Let f be a closed convex function on R^. Then df is a single-valued 
mapping (ie. for every x, df{x) contains at most one element) if and only if f is 
essentially smooth. In this case, df reduces to the gradient mapping Df i.e. df(x) 
consists of the vector Df{x) alone when xeint (dom /) while 0/(x) = 0 when 
x^int(dom /). 

For a proof see Rockafellar (1970), p. 252. 

A convex function / on R^ will be called essentially strictly convex iff is strictly 
convex on every convex subset of dom df. 

Theorem 5.29. Let f be a closed convex function. Then f is essentially strictly 
convex if and only if df{xi)n df(x 2 ) = 0 whenever Xj ^ X 2 . 

For a proof see Rockafellar (1970), p. 254. ► 

Theorem 5.30. A closed convex function on R^ is essentially strictly convex if and 
only if its conjugate is essentially smooth. 

For a proof see Rockafellar (1970), p. 254. ► 

The concept of conjugacy for convex functions is closely related to the classical 
concept of Legendre transformation. We shall now describe this correspondence. 
Let / be a differentiable real-valued function defined on an open subset U of R^. 
The Legendre transform of the pair (U,/) is defined to be the pair (V,g) where 
V = Df(U) is the range of the gradient mapping Df and g is the function on V 
given by the formula 

g(x*) = x-x* - fix) 

where x satisfies 


X* = Dfix). 

(The gradient mapping Df does not have to be one-to-one in order for ^ to be well 
defined, i.e. single- valued. For this it suffices that 

- /(^i) = X2 • X* - /(X2), 

whenever Df(xi) = Dfixf) = x*. Then the value of ^(x*) can be obtained 
unambiguously from the formula by replacing (Df)' ^(x*) by any of the vectors it 
contains.) 
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The classical areas of application of the Legendre transformation lie within the 
theory of differential equations (cf. Kamke 1930, 1974) and the calculus of 
variations fcf. Courant and Hilbert 1953, pp. 231-242). 

If U and/ are both convex,/ can be extended (in unique manner) to a closed 
convex function on all of with U as the interior of its effective domain. There is 
then the following relation between the Legendre transform of {U,f) and the 
conjugate of the extended /. 

Theorem 531. Let f be any convex function on such that the set U = mt(dom7) 
is non-empty andf is differentiable on U, The Legendre transform {V, g) of {U,f) is 
then well-defined. Moreover, V{= Df{U)) is a subset of dom/* and g is the 
restriction of /* to K 

For a proof see Rockafellar (1970), p. 256. ^ 

On adding the assumption that / is essentially smooth one obtains the 
following sharpening of Theorem 5.31 : 

Theorem 532. Let f be any essentially smooth closed convex function on and let 
L = int(dom/) Then the Legendre tranfform {V,g) of (U,f) is well-defined. 
One has V = dom /* so that V is almost convex in the sense that 
ri(dom /*) c F c dom /*. Furthermore g is the restriction of /* to V and g is 
strictly convex on every convex subset of V. 

For a proof see Rockafellar (1970), p. 257. ► 

The condition of steepness in the definition of essential smoothness is needed 
for the conclusion (in Theorem 5.32) that V is almost convex. Suppose, for 
example, that / is the closed convex function on R^ determined by 

^2 

f(x) = ^ for Xi e R, Xn > 0. 

4X2 

The steepness condition fails at the origin (0, 0) and F is a parabola 

V={x^:x*2= -(x?)^}. 

This example is due to Rockafellar. 

A pair (C,/) is said to be of Legendre type if C is an open convex set and/is a 
strictly convex and differentiable function on C such that 

\Df(Xi)\ 00 

whenever Xi , X 2 , . . . is a sequence of points in C converging to a boundary point 
of C. 

Clearly, if a convex function / on R!" is essentially strictly convex and essentially 
smooth then (int(dom /),/) is of Legendre type. 

Theorem 533. Let f be a closed convex function. Let C = int(dom/) and 
C* = int(dom/*). Then (C,/) is a convex function of Legendre type if and only if 
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(C*,/*) is a convex function of Legendre type. In this case (C*,/*) is the Legendre 
tran^orm of (C,f) and vice versa. The gradient mapping Df is then one-to-one from 
the open convex set C onto the open convex set C*, continuous in both directions^ 
andDp ^^{DfyK 

For a proof see Rockafellar (1970), p. 258. ► 

Let /be a closed convex function on recall the notation used at the end of 

Section 5.3 in relation to partial conjugation, and set 

M = {(w, dffv )) : (u, v) e dom df }. 

Suppose ueriD, Then a vector u* belongs to dfj^) if and only if (w*, t?*) belongs 
to d{ f + 5( • |w X R!^)) (w, t;) for every w* e R\ as follows directly from the definition 
of subgradient. But, by Theorem 5.24 

d{f+b{^\u X R^)) = a/4- d<5(-|n x R^) 

and hence, for every v e R"*, 

d{f 4- <5( • In X R"*))(n,i;) = df{u,v) 4- R' x {0}. 

Consequently, for uexiD, dffv) equals the projection of df{u,v) onto {0} x R'” 
(interpreted as a subset of R”*). 

From this and the proof of (4) in Section 5.3 it is simple to see that 
riDxriRcMczDxclR 
and that if / is of Legendre type then 
(3) M = into X int E, 


Theorem 534. Let f be a closed convex function with dom / open and suppose f is 
strictly convex and differentiable on dom/ For xEdomf denote Df{x) by 
d( — d(x)\ and let x = (x^^\ and d = (d^^\ d^^^) be similar partitions of x and d. 

Then the mapping defined on dom f by x d^^f is a homeomorphism, and 

x^^^ and d^^^ are variation independent. 

Proof, Using Corollary 5.2 one finds that / is of Legendre type, and the theorem 
then follows from Theorem 5.33 and formula (3). 

The idea of the proof of this theorem is due to Rockafellar. ► 


5.5 COMPLEMENTS 

(i) A convex set is called polyhedral if it is the intersection of a finite collection of 
closed halfspaces. 

Theorem 535. Let f be a closed convex function on R^, and let K be a non-empty 
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dosed convex cone in R^, Let K* be the negative of the polar of K, i.e. 

= {x^:x-x^ >0 V xeK}. 

One has 

M{fix):x€K} = -inf{/*(x*)::x:*6iC*} 
if either of the following conditions hold: 

(a) ri(domf)nriK 9^ 0; 

(bj ri (dom f’^) n riK* ^ 0. 

Under (a), the infimum of /* over K* is attained, while under (b) the injimum of f 
over K is attained. 

If K is polyhedral, ri K and ri K* can be replaced by K and in (a) and (b). 
In general, x and x* satisfy 

fix) = inf / = -inf /* = -/*(x*) 

K K* 

if and only if 

x*6d/(x),x6K,x*eK*, x-x* = 0. 

For a proof see Rockafellar (1970), p,335. ► 

(ii) Convex support. Let 71 denote a probability measure on R*. A point xeR^ is 
said to be a point of support for 71 provided every neighbourhood of x has positive 
7r-measure. It follows immediately from Lindeldf ’s covering theorem that every 
Borel set with positive 7c-measure contains a point of support. Let 5 = 5^ denote 
the set of support points for n. S is called the support of n. It is a closed set and may 
be characterized as the smallest closed set having Ti-measure 1, i.e. the intersection 
of all closed sets with measure 1. The convex support C = Q of tc is the closed 
convex hull of S. It is the smallest closed convex set with measure 1. (A more 
precise name for C would be ‘closed convex support’ but since the convex hull of S 
does not, in itself, play any prominent part in what follows, the shorter ‘convex 
support’ is adopted here.) Finally, aff S is called the affine support of n. 

Let h denote a continuous mapping on R!" into R^. Then /i(S J c= and /zCiS J is 

dense in If /2 is a homeomorphism onto R^ then h(Sfj — 


Theorem S 36 . If a is an affine mapping on R^ into R^ then cl a(C^) = Cj,a tind 
fl(ri CJ = ri 

Proof If Mo and M are arbitrary subsets of R^ with Mq c: M and Mq dense in M 
then cl conv Mq = cl conv M. Hence, by Theorem 5.9, 


cl a(C„) = cl a(cl conv 5„) = cl conv a{S^) = cl conv S^a = Qa 
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and 

a(ri C^) = airi conv SJ = ri conv a(S^) = ri cl conv a{S^) = ri ► 

Let X denote the identity mapping on R^. If the mean value of X with respect to 
% exists then E^X e C. 

Suppose that Xi , X 2 , . . . , X ^, ... is a sequence of independent /c-dimensional 
random vectors with common marginal distribution n. Let S„ be the support of 
the distribution n„ of 

_Xi + Xn 

— . 

n 

Then 

(1) Sn = cl{x:x = (xi + • • • + Xn)ln,XieS for / = 1 n] 

and the convex support of 7i„ is equal to C for all n, i.e. 

(2) cl conv = C. 

Moreover is asymptotically dense in convS in the sense that to every 
XoeconViS and every neighbourhood U of x^ there exists an such that 

Snr\U i^0 "i n>nQ. 

(iii) Jensen's inequality. Let tt be a probability measure on with convex support 
C and suppose that its mean value 

p = j xdn 

exists. Furthermore, let / be a convex function on R^ such that C c: dom / and 
p e dom df. Then Jensen’s inequality 

/(/i)< ^fdn 

is valid. 

This may be seen by taking a p'^edf {p\ noting that 

f{x) > f{p) + (x - p)* p"^ V X€ jR\ 
and integrating with respect to n. 

Suppose equality holds in Jensen’s inequality. Then on a convex set having 
probability one/ coincides with an affine function. And if / is strictly convex on C 
then 7i{p} = 1. 




CHAPTER 6 

Log- concavity and Unimodality 


6.1 LOG-CONCAVITY 

A function g on and into [0, oo) is called logarithmically concave of log-concave 
if In^ is concave. Otherwise expressed, g is log-concave if/= ~ln^ is convex. 

Theorems 6.1 and 6.2 below (which, except for a sharpening of Theorem 6.Z 
have previously been presented in BarndorfF-Nielsen (1973b)) give various 
criteria for summability of log-concave functions. These are formulated in terms 
of / rather than g. 

Theorem 6.1 . Let fbe a closed convex function on with int dom/ # 0. The 
following four conditions are equivalent 

(i) J dX < 00 

{where X denotes Lebesgue measure on R^). 

(ii) There exists an xe dom / such that 

lim inf f{x + Xy) — oo 

X-*-oo 

for every yeR^ with y ^0. 

(iii) There exist two scalars X and a such that 2 > 0 and 

f(x) > X\x\ + a, xeR^. 

(iv) 0 6 int dom/*. 

Proof By Theorem 5.15, condition (ii) above is equivalent to the condition 

(ii) ' (/O )(y) >0, yeR\y ^0, 

Condition (ii/, in turn, is equivalent to (iv) on account of Theorem 5.21. 

Next it will be shown that (iv) implies (iii)./* is continuous on int dom/* and 
hence, if Oeint dom/*, there exists a A > 0 and an a such that 

/'*(Ac(x)) < a, X ^0 
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where xeR^ and e{x) denotes the unit vector in the direction determined by x. 
From the definition of /* it follows that 

/£{x) • X - fix) < f^{Xe(x)\ X 0. 

Consequently 

/(x) > yl|x| — a, X 7 ^ 0 

and thus (iii) holds. 

It is simple to demonstrate that if (iii) is valid then so is (i). 

Thus it only remains to prove that (i) entails (ii). Suppose (ii) is not fulfilled, then 
there exist an xedom/ and a y 7 ^ 0 for which 

lim inf /(x + Ay) < 00 

00 

and this implies, by Theorem 5.15, that/(x' + Ay) is a non-increasing function of 
A, - X < A < X), for every x' e R^. Clearly, then 


j 


e dX = CO. 


► 


Theorem 6.2, Let Shea subset ofZ^ and let f be a closed convex function on R^ with 
int dom/ 7 ^ 0 . 

If 

(1) f-UX<oo 

then 

^ < X. 

ieS 


The converse assertion holds provided 0 # Z* n int dom/ c; S. 

Proof. Set 

Mv = |i ;/ € S, V - 1 < |/| < v}, V = 1, 2, ... . 

Trivially, the number of elements in Mv does not exceed (2v + 1/. Using this and 
the equivalence of (1) to condition (iii) in Theorem 6.1 one finds 

I e-^<--) 

ieS V = 1 ieM V 

00 

< X (2v + 

v=l 



< X. 



Log-Concavity 95 

Suppose the conditions for the converse assertion are satisfied and assume that 

^ d/. = 00 . 

Then (cf. the end of the proof of Theorem 6.1) there exists a y ^ 0 such that 
/(x -f Av) is non-increasing as a function of A for every x e R^. Let ig be a point in 
n int dom /, let U be an open set such that ig e U and 

a = sup{/(x):xG 17} < oc, 


and set 


M = n {x + Xy:xeU,X > 0}. 

Then M a S and /(/) < o* for i e M. Moreover, as will be shown below, M contains 
infinitely many points. Consequently 

Y, > Y 

leS ieM 

In proving that M is an infinite set, it causes no loss of generality to assume that 
ig — O.Furthermore, it suffices to show that the distance between the two sets Z^ 
and [ny:n= 1 , 2, . . . } is zero. If the coordinates y i , . . . , y^ of y are such that they do 
not satisfy any equality 

(2) miyi + • • * + m^yk = mg 

where mo, mi , . . . , are integers, not all 0, then this follows from the well-known 
theorem of Kronecker (see e.g. Hardy and Wright (1960) p. 382 which states that 
the sequence ny modulo 1 (n = 1, 2, . . .) is everywhere dense.) The general case, of 
possible dependence between y i , . . . , yfe of the form (2), is now simply dealt with. 

A probability measure n on is said to be log-concave if ^ 

7r((l - A)Co + XC,)>n{Cg)^-UC^f 

for all convex subsets Cq and Ci of R^, and all Ag(0, 1). 

It follows simply from this definition that if n is log-concave and if a is an affine 
mapping on R^ and into R^ then na is also log-concave. 

Theorem 63, Let % be a probability measure on R^ of continuous type. 

Then n is log-concave if and only if dnjdX is log-concave. 

This result is due to Prekopa (1971), who proved the if assertion, and to Borell 
(1975) (see also Prekopa 1973). ^ 
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6.2 UNIMODALITY OF CONTINUOUS>TYPE DISTRIBUTIONS 

Let 7 t be the probability measure of a continuous type distribution on and let S 

and p denote, respectively, the support and the density of n. Set 

(p = — In p. 

The distribution is said to be unimodal if <p is quasiconvex and strongly 
unimodal if (p is convex. Clearly, these specifications are equivalent to p being 
quasiconcave, respectively log-concave. It may also be noted that in the one- 
dimensional case, 71 is unimodal if and only if there exists a real number m such 
that p is non-decreasing on (-oo,m] and non-increasing on [m, co). 

Unimodality is a rather weak property which allows only a very limited set of 
useful conclusions to be drawn. In particular, it does not hold that the marginal 
distributions of a unimodal distribution are unimodal (counterexamples are 
easily constructed), nor that the convolutions of unimodal distributions are again 
unimodal (for a one-dimensional counterexample, see Appendix II, written by 
K. L. Chung, in Gnedenko and Kolmogorov (1954)). 

The closure, cl /, of a convex function / on agrees with / except, possibly, at 
relative boundary points of dom /. Hence, when tc is strongly unimodal, nothing is 
lost by assuming cp closed. 

Suppose the distribution is strongly unimodal with cp closed. Since 

one finds from Theorem 6. 1 that 0 e int dom (p* and hence, by Theorem 5.20, the 
non-empty sets among the level sets {x: p{x) > a}, ae R, are all closed, bounded, 
and convex, and the set 

{x;p(x) = sup p(z)} 

Z 

of mode points of p is non-empty. 

On account of Theorem 6.3 and the remark preceding that result one has that if 
71 is strongly unimodal and if a is an affine mapping on R^ and into R} such that 
I < I < k and a has rank / then na is strongly unimodal. In particular: 

Theorem 6,4. Marginal distributions of strongly unimodal {continuous type) 
distributions are again strongly unimodal 

Corollary 6.1. Convolutions of strongly unimodal {continuous type) distributions are 
again strongly unimodal 

Proof Let 7ii and 712 be strongly unimodal probability measures on R\ with 
densities pi and P 2 - The product measure tz = 71 1 x 712 has density p given by 
p{x) = Pi(xi)p2{x2) where x = {xi,X2)eR^^. It is thus obvious that k is strongly 
unimodal, and since strong unimodality is preserved under regular affine 
transformations, the result follows. ^ 
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Corollary 6.1 was proved for k = 1 by Ibragimov (1956) and generally by 
DavidoviS, Korenbljum, and Hacet (1969). In fact, it follows from Ibragimov 
(1956) that, in the one-dimensional case, one has the stronger result. 

Theorem 6.5. A one-dimensional, continuous-type distribution is strongly unimodal 
if and only if its convolution with any unimodal, continuous-type distribution is 
again unimodal ^ 

(Actually, Ibragimov introduced the concept of strong unimodality by the latter 
property.) Theorem 6.5 does not generalize to higher dimensions; in fact, an 
example, due to T. W. Anderson and given in Sherman (1955), shows that the only 
if assertion of Theorem 6.5 is not true for two-dimensional distributions. 

Example 6.1. Multivariate normal distribution. For the L)-distribution the 
function (p is, except for additive constants, given by 

Thus iV^({,2) is strongly unimodal. ► 

Example 6.2. Gamma distribution. Up to additive constants the cp function of the 
gamma-distribution with shape parameter X and scale parameter f is 

— (>l — l)lnx + x/f. 

The distribution is unimodal for all (X, jS) 6(0, oo)^ and strongly unimodal if A > 1. 

► 


Example 6.3. A one-dimensional distribution with density 

where a and f are parameters, a{oc, f) is a norming constant, and cpo is a convex 
function, is obviously strongly unimodal provided a > 0. Distributions of this 
form are: normal ((po{x) = x^), gamma {(po{x) = — Inx), Laplace {(po{x) = |x|), 
generalized inverse Gaussian with power parameter 1 {(poix) = 1/x) and hyper- 
bolic i(po(x) = >/(l + x^)) (see the next example). ► 

Example 6.4. Multivariate hyperbolic distribution. The r-dimensional hyperbolic 
distribution has probability function 

(1) aioc,l3,5,-L)expi-oc^{d^ + (x - - O'} +P-ix - 

where a > 0, 5 > 0, PeR', and E, a positive definite r x r matrix, are 

parameters and, for x = 

1 /•^V(2a) 

i2nY-- tyiiSxY 


aid, P, 5, A) = 
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1 j *1 cienoting the modified Bessel function of the third kind and with index 
in -I- lt'2. (The hyperbolic distributions were introduced in Barndorff-Nielsen 
(1976).) As a function of x, the logarithm of (1) determines a hyperboloid, and 
hence the distribution is strongly unimodal. ^ 

Example 6.5. Dirichlet distribution. The density of the Dirichlet distribution with 
parameters Ai > 0 , > 0 is 


ru. 1 






for Xi > 0,...,Xk> Q and x < 1, and 0 otherwise. It is strongly unimodal 
provided Aj > 1 , i = 1 , . . . , /c + 1 . ^ 

Example 6.6. Wishart distribution. The ^)-dimensional Wishart-distribution 
with / degrees of freedom and mean value /L is the distribution of a random 
variable of the form x = y jVj H — + y^Vj where Vj- are independent and 

.¥^.{0, L)-distributed. It will be denoted by and it is of continuous type 

for / > r, in which case its density with respect to Lebesgue measure on RC 2 ^)is 
given by 

w,(f)li:| exp {-ftr (L-^x)} 

for X (viewed as an r x r matrix) positive definite. Here 

i=i \ 2 / 

The distribution is strongly unimodal because — In | x | (interpreted as + x for x 
not positive definite) is a convex function on RUt ^), as will be verified in Example 
7.2. ^ ^ 


6.3 UNIMODALITY OF DISCRETE-TYPE DISTRIBUTIONS 

Consider a discrete type probability measure rc on R^ and let p be the probability 
function of n (i.e. p{x) = k{x} for all xeR^). Set ^ 

(p = -Inp, 

denote the support of tt by S and suppose Sa Z^. 

7z is said to be unimodal, respectively strongly unimodal^ provided S equals the 
intersection of Z* and some convex set, and there exists a quasiconvex, 
respectively convex, function on R^ which coincides with cp on S. If Sis of the form 
specified in these definitions — ^i.e. if n is c-discrete — then, as is simple to see, n is 
strongly unimodal if and only if (p and conv(j> coincide on S. For k = I, uni- 
modality is equivalent to the sequence p{i), i e Z, being first non-decreasing and 
then non-increasing, and strong unimodality is equivalent to S being a set of 
integers and 
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(I) piif > p{i - l)p(i + If ieZ. 

The analogue of Theorem 6.4 does not hold for discrete- type distributions, not 
even if only marginalizations, which accord with the assumed lattice structure of 
the supports, are considered. 

Example 6.7, The two-dimensional distribution with support {0, 1,2}^ and point 
probabilities p{Uj) as given in Table 1 is strongly unimodal. 

Table 1, Table of cp{ij) where c = 13 -h 2^2 
V: 0 1 2 

0 2 ^,2 1 

14 2 1 

2 2 1 

However, if x = (xi,X 2 ) is a random variable following this distribution then 
the distribution of X 2 is not strongly unimodal. > 

Also, Corollary 6.1 does not, in general, extend to the discrete case. 

Example 6.8. Let and 712 be the probability measures on having support 
{0, 1}^ and point probabilities 7rj(0, 0) = n^(l, 1) = 4/10, 71^(0, 1) = 71^(1, 0) 
= 1/10, respectively 712(0,0) = 712 ( 1 , 1) = 1/10, 712 ( 0 , 1) = 712(1,0) = 4/10. The 
convolution tt = 711 * 712 has support {0, 1,2]^, and 7i(l, 1) = 16/100 and 7r(0, 1) 
= 71(2, 1) = 17/100 so that 7r(l, 1)^ < 7r(0, 1) 7r(2, 1). Thus n is not strongly 
unimodal even though both 711 and 712 have this property. ► 

In three dimensions it may even happen that the n-fold {n > 3) convolution of a 
strongly unimodal distribution with itself is not strongly unimodal (cf. the remark 
after Example 10.16). 

Some useful, partial extensions of Corollary 6.1 are however available. Firstly, 
in one dimension the result carries over. (This result is contained already in a 
paper by F ekete (191 2).) What is more, in complete analogy with Theorem 6.5 one 
has: 

Theorem 6.6. A one-dimensional distribution, whose support is a subset of Z, is 
strongly unimodal if and only if its convolution with any unimodal distribution, 
whose support is a subset of Z, is again unimodal. 

This result was given by Keilson and Gerber (1971). ► 

One immediate consequence of Theorem 6.6, which is used at several places in 
the present treatise, is that convolutions of binomial distributions (with, possibly, 
differing probability parameters) are strongly unimodal. 

A number of sufficient conditions for two- or three-dimensional, discrete 
distributions to be strongly unimodal have been derived by Pedersen (1975a, b). 
In particular, some of these conditions imply that convolutions of trinomial 
distributions (with, possibly, different probability parameters) are strongly 
unimodal Furthermore, Pedersen used the conditions to prove the strong 
unimodality of the distributions of various marginals of certain two- and three- 
dimensional contingency tables, cf. also Examples 10.12 and 10.13. 
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Example 6.9. The Poisson distribution satisfies ( 1) and is hence strongly unimodal 

► 

Example 6.10. The negative binomial distribution is unimodal for all values of the 
shape parameter and, by (1), strongly unimodal if % > 1. ^ 

Example 6.11. The multinomial distribution has point probabilities 

Clearly, in order to show that this distribution is strongly unimodal it suffices to 
establish the existence of a convex function on which coincides with 
-ln(") for ieZ^, > 0, . . . , ifc > 0 and i < n. Now, 

nn _ r(i, + i)---rfa + i)r(n-f + 1) 

if] r(n4-l) 

= c(0 

where c is the Laplace transform of the continuous type probability measure on 
having density 

(2) {n + /cf + e^^ + • • • + ^ e^’ 

at y = (yi,...,yji)6JR^ Since Laplace transforms are log-convex (cf. Section 7.1), 
the result follows. ► 

Example 6.12. Negative multinomial distribution. By an argument similar to that 
given in Example 6.11 it may be shown that the /c-dimensional negative 
multinomial distribution with shape parameter % > /c is strongly unimodal, the 
density corresponding to (2) being given by 



for e^^ - 1 - • • • + e^'* < 1, and 0 otherwise. ^ 

Example 6.13. The multivariate hypergeometric distribution is strongly unimodal. 
A simple proof of this will be given in Example 9.18. ^ 

6.4 COMPLEMENTS 

(i) Holdef s inequality and log-convexity. One of the possible ways of expressing 
Hblder’s inequality is to say that if / and g are (Borel) measurable functions on R^ 
into [-00, oo), ju is a cr-finite measure on R\ and Ae[0, 1] then 

J dju I I Je^ dp 




( 1 ) 


1 -A 
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(as follows from the elementary inequality 

< /M 4- (1 — X)b. a > 0,b > 0, 

by setting a = exp//J exp/^/i and b = Qxpg/^Qxpg dp). In other words, the 
mapping / Jexp/ dp is log-convex. 

Provided the integrals are finite and 0 < A < 1, equality holds in (1) if and only 
if / and g are equal up to an additive constant. 

It follows from Holder’s inequality that if are log-convex with 

domln/j n ... n domln/^ ^ 0 andifA^ > 0, ...,A^ > OthenA/^ H + Xjjs 

log-convex. To see this, it suffices to consider the case n = 2, Ai = A 2 = i. For 
^=/j_-f/ 2 , 0 <A<l and x and y in dom In fi n dom In /2 one has by the log- 
convexity of fi and /2 

f (^Xx -h (1 — A)y) < 4 - 

Viewing the sum of the two exponential terms as an integral and applying (1) one 
obtains the result. 

(ii) Let TT be a continuous type probability measure on R and let n be an integer 
with n > L 

The product measure is unimodal if and only if tt is strongly unimodal ; and 

then is strongly unimodal. 

The only nontrivial part of this proposition is the only if assertion, which was 
established^ via the concept of Schur concavity, by Marshall and Olkin (1974). 

(iii) Let F be the distribution function of a one-dimensional distribution and 
suppose (for simplicity) that F is differentiable. Khintchin showed that the 
distribution is unimodal with mode at 0 if and only if 

G(x) = Fix) - xF(x) 

is a distribution function (see Gnedenko and Kolmogorov 1954, p. 157). 

It may be noted that — G is, in essence, a Legendre transform of F. 

(iv) Any mixture of a Poisson distribution with a continuous type, unimodal 
mixing distribution is unimodal (Holgate 1970). The crux in Holgate’s proof 
consists in showing that if the density / of the mixing distribution is differentiable 
with Ao > 0 as a mode point then the point probabilities 

nco in 

Pn=\ 

Jo nl 

of the mixture satisfy 

(n + l)Ap„ < loAp„-i, n = l,2,..., 
where Ap„ = p„+i - p„. 
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According to the result of Khintchine mentioned under (iii) 
G(2) = F(2) - (2 - 2o)/(2) 

is a distribution function, and a simple calculation shows that 


^■O^Pn-l - in + 



which is obviously non-negative. 



CHAPTER 7 


Laplace Transforms 

7.1 THE LAPLACE TRANSFORM 

Let /i be a positive, cr-finite measure on such that the function c on and into 

(0, oo] defined by 

c(6l) = 9eR’‘ 

is not identically + oo. Then c is called the Laplace transform of p, and the effective 
domain of c is the set 

© = {0: c(9) < oo}, 

which will be denoted by dom c. 

Note that if 0o ^ ® then the function 

( 1 ) CeM = cie + 9o)lc{dol 9eR'‘ 

is also a Laplace transform, namely of the probability measure on R^ given by 

(2) ^(x) = c(0o)-'e'’'=^ 

Let S and C denote the support and the convex support of p (i.e. C is the closed 
convex hull of S). Clearly, any probability measure n of the form (2) has the same 
support as p. 

Set 


fc = In c. 

Theorem 7.1. The logarithm k of the Laplace transform c is a closed convex function 
on R^ and k is strictly convex on dom k provided p is not concentrated on an affine 
subspace of R^. 

Proof In view of (1) it causes no loss of generality to suppose that pis a, probability 
measure. The statements on convexity then follow from Holder’s inequality, 
including the criterion for equality, see Section 6.2(i). In proving that k is closed it 
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For any d = (ei,02)eR^ one has 
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e^'^f{x)dx 2 =il + xf) ^ 




whence 

© = {0:N<1 or 0 i=O, 02 =±l}, 


and, for | 02 | < 1 

( 4 ) dd) = r (1 + xf)-‘exp{-(l - el)ix^ - ej\2{\ - ODf} 

J — 00 

Letting 6 tend to (0, 1) along the curve (s,^(l — e^)), 0 < £ < 1, one finds that 
c{6) tends to oo because the factor in front of the integration sign behaves as 
exp { 1 /4s} for £ i 0 while the integral is of the order of magnitude s^^^. Thus c is not 
continuous at (0, 1). ► 

Henceforth in this section only transforms of probability measures n will be 
considered. For such measures k is called the cumulant transform. Let 
X = (xi , . . . , Xjt) be a random variable with distribution n. Then 


c(0) = £„eH 

It is simple to see that for any vector eeJ?* of unit length one has 


( 5 ) 


fo 


lim e ^^c{le) = 

A-^oo 


< n{e-x = S*(e\Q} 
loo 


ford > (5*(£|C) 
for d = 3*{e\Q. 
ford < (5*(£|C). 


The F ourier-Laplace transform of n is the complex-valued function c defined 
on the set 


0 = {^ = 0 4 - iffiOeS^rieR^} 


by 





Theorem 7.2. The F ourier-Laplace transform c is an analytic function on int ©, and 
derivatives of c at points in int © can be computed by differentiation under the 
integration sign. 

Proof Let ej be the jth unit vector, = (0, . . 0, 1,0, . . . , 0), let C e © and let be a 
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complex number #0. Then 


(6) 


c(C + hCj) - c(0 
h 



’'dn 



where the dominated convergence theorem has been used in conjunction with the 
inequality 


_ 1 ^ + e“^^ 

h~ ^ S 


for \h\ < 5 


which holds for any asR and ^ > 0, since 


v! ~ S 

Thus, on int 0, c is analytic separately in each of the coordinates Ci ? • • • » of C and 
this, by a theorem of Hartog (see e.g. Bochner and Martin 1948, p. 140) implies 
that c is analytic as a function of C- The above also shows that first order 
differentiations may be performed under the integration sign, and the result for 
general order is provable by induction. ► 

Corollary 7.L The Laplace transform c is infinitely often differentiable at every 
point 6o 6 int 0 and the derivatives may be computed by differentiation under the 
integration sign. Furthermore, ifOe int 0 then c can be expanded in a power series 
around zero, 


^ el" 




|0|<<5 


for some <5 > 0, and here i = (fp . . . , ij), is the ith order moment of 

Zfc = 

Corollary 7.2. //Og0 then k can be expanded in a power series around zero, 

III Ikl 

for some <5 > 0, and here k^, i = (zi,. . Zfc), is the ith order cumulant of x. 


These two corollaries may also be proved without invoking Theorem 12, and 
hence — implicitly — Hartog’s theorem, essentially by using the properties ex- 
pressed in (6) and (7). Furthermore, it was shown, in the proof of Theorem 7.2, that 
c is analytic separately in each coordinate of C, a fact which together with the 
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uniqueness theorem for characteristic functions (or Fourier transforms; see e.g. 
Kawata 1972, p. 326-327) implies: 

Theorem 7.3. (Uniqueness). Let Ci and Ci be the Laplace transforms of two 
probability measures tii and n 2 on R'‘. If there exists an open set M cz R’^ such that 
cfO) = C 2 ( 0 ) < 00 for OeM then tcj = 7t2. 

Proof It causes no loss of generality to assume OeM. Let c, denote the 
Fourier-Laplace transform of 7t„ i = 1, 2. Analytic continuation in one co- 
ordinate of C at a time shows that cfirj) = C 2 (it]) for t] e R\ i.e. and 7 t 2 have the 
same characteristic function. ^ 

7.2 COMPLEMENTS 

(i) As in Sections 6.2 and 6.3, let ^ = -Inp where p denotes the probability 
function of a distribution on R‘‘ of either continuous or discrete type. For a 
number of the standard statistical distributions that are strongly unimodal the 
function cp does coincide, on the support S' of the distribution, with a function 
which is not only convex but, in fact, equal to the logarithm of a Laplace 
transform (cf. Theorem 7.1). That the Wishart distribution has this property was 
shown by Examples 6.6 and 7.2. Other instances are: the normal distribution on 
R\ the gamma distribution with shape parameter 2 > 1, the Poisson distribution, 
the multinomial distribution, and the negative multinomial distribution with 
shape parameter % > 1 (as concerns the latter two distributions see Examples 
6.11 and 6.12). 

(ii) The following result was given in Barndorff-Nielsen (1970), but the present 
proof, which uses convex duality, is simpler. 

Theorem 7.4. To any open convex subset © of i?*' with Oe© there exists a 
probability measure n on R'‘ such that © is the domain of the Laplace transform ofn 
and such that the affine support of n is equal to R'‘. 

Proof Let cp* denote a closed convex function on with the properties that 
dom (p* = Q and 


lim inf cp*{Xy) — oo 

X-*co 

for every yeR^ with y 0. The existence of such a function was shown in 
Example 5.2. Set cp = By Theorem 6.1, 0Gintdom<p and thus intdom^ ^ 

0. Hence the same theorem can be applied to cp and, since Og 0 = intdom<p*, 
one obtains 


e < 00 . 
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For any constant d one has {(p* -h d)* — (p — d and thus, for suitable choice of 9 *, 

Now, let TL be the absolutely continuous probability measure on jR* whose density 
is exp(“<p). Clearly, the afBne support of n is R\ and the formula 

together with a straightforward, further application of Theorem 6 . 1 , shows that 
the Laplace transform of % has 0 as domain. ^ 



PART 

III 


Exponential Familiest 


The theory of exponential families is developed from the beginning. After the 
introductory theory has been established, a systematic study is made of duality 
relations and the principal lods functions in exponential families. Finally, the 
character of ancillary and sufficient statistics, under exponential models, is 
investigated. 


t Throughout Part III, ip stands for an exponential family. 




CHAPTER 8 


Introductory Theory of Exponential 
Families^ 


A major part of the more elementary properties of exponential families are 
discussed. However, questions concerning duality and lods functions (log- 
probability functions, log-likelihood functions, log-plausibility functions) for 
these families are deferred to the next chapter. 


8.1 FIRST PROPERTIES 


The family of probability measures ^ is said to be an exponential family provided 
there exists a cr-finite measure /i on X, a positive integer k, real-valued functions 
< 3 , ai , . . . , afe on ^ and real-valued measurable functions fc, ti , . . . , on X such that 
ip is dominated by p, b > 0, and for every Peip 


( 1 ) 


— (x) = a{P)b{x) 
dp 


where a = (ai, -..,afe), t = (ti, . . In this case (1) is called an exponential 
representation of the densities of ip with respect to p. The probability measures in 
an exponential family ip are mutually absolutely continuous. Hence they all have 
the same support. Let Pq be an arbitrary element in ip then, by (1), 


( 2 ) 


^ g(a(P) - a(Fo))- * 

dPo a{Po) 


for all Peip. 

Formula (2) in conjunction with Theorem 4.2 shows that if ai,...,afc are 
affinely independent then t is minimal sufficient. 

When ip is exponential then to any (7-finite measure p dominating ip there 
exists a representation of the form (1). For each dominating p let k{p) denote the 


t This chapter is largely selfcontained. 
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smallest integer such that the densities of the probability measures in ^ with 
respect to p are representable as in (1). Factually, kiji) is an integer independent of 
p ; this integer is called the order of ip and is denoted by ord Any representation 
( 1 ) with k = ord ip is said to be minimal. 

Suppose (X, W, iP) is a statistical field such that the elements of ip are mutually 
absolutely continuous and let Pq be an arbitrary element of ip. It is simple to see 
that ip is exponential of order k if and only if dim V = k + I where F denotes the 
linear space of functions on X generated by 1 (= l^) and IndP/dPQ, Feip. 
Note that if ip is an exponential family having representation (1) and if 

(3) ^(x) = a(P)i)(x)e , 

dp 

with d and t of dimension /c, is another representation of ip then 

(4) Aa • At = Aa • At 
where 

Aa = a(P) - aiPol Aa = a{P) - a(Po) 

At = t(x) — t(xo), AT — t(x) ~ t(xo) 
and Po,Peip, xo, xeX. 

Lemma 8.1 , Let (I) and (i) be two representations of an exponential family ip and 
suppose (/) is minimal. 

Then !c> k and there exist two constant k x k matrices A and A, both of rank k, 
and two constant 1 x k vectors B and B such that 

(5) tA + B = t 

(6) aA -I- B = a 
and 

(7) A'A = I,, 
tft are affinely independent then 

a = aA' + jD, 

and if cCp . . . , ure affinely independent then 

i = tA + Z), 

where D and D denote constant vectors. 

Proof. Since (1) is assumed minimal, both t^ . . . , t^ and a^, . . . , are affinely 
independent. From this fact and formula (4) the lemma follows in a simple way. 

It may also be noted that (7) implies that AA' is idempotent. Thus AA' and A A' 
are projections, onto R^A' and respectively. 
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As another consequence of Lemma 1 one has: 

Corollary 8.L The representation (1) is minimal if and only if both of the following 
conditions are satisfied: 

(i) ai , . . . , afc are affinely independent. 

(ii) ti, . . . , tfc are affinely independent. 

Consider the representation (1) and let © = a(^). The mapping P->a(P), 
Pe% is one-to-one. Thus 0e0}, where P^ = oi~\e), is a parametrization of 
Sp. Such a parametrization is called canonical, and minimal canonical if (1) is 
minimal. 

The statistics t occurring in the various possible representations (1) are called 
canonical statistics; t is said to be minimal canonical if it occurs in a minimal 
representation. 

Where the identity mapping x on 3E is a minimal canonical statistic, ^ will be 
designated as linear. 

An indexed family of probability measures ^ = {Pf^icoeQ,} with probability 
functions of the form 

p{x;co) = 

where a((D) > 0, b{x) > 0, a(co) e and t{x) e R^, is clearly exponential. If 1 1 , . . . , 4 

are affinely independent then 

ord ^ = dim (alf a(Q)), 

and, furthermore, the mapping co P^ is one-to-one if and only if m a(cn) is 
one-to-one. 

Example 8.1. von Mises-Fisher distributions. Let 3E = 5^, the unit sphere in R^, 
and let ^ = {P(fi,xy{p,x)^^d ^ [Oj 00 )} be the family of von Mises-Fisher distri- 
butions on Sd given by 

dP 

( 8 ) = 

where Pq is the uniform distribution on while p and x are parameters, which 
vary independently, °o). These parameters are called the mean 

direction and the precision, respectively. The norming constant a(x) depends on x 
only and it may be expressed as 

aix) = V{(2#%,-i(x)} 

with Jv(*) denoting the modified Bessel function of the first kind and order v. Thus, 
for d = 2, i.e. for the von Mises distribution, 

a(z) = Jo(x)"S 
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and for d = 3, i.e. for the Fisher distribution. 


^iX) = 


X 

sinh X * 


(For a comprehensive account of the history and theory of the von Mises-Fisher 
distributions see iVfardia (1972, 1975).) 

^ is exponential of order d and 9 = xpt, which varies in © = is a minimal 
canonical parameter. The mapping (/i, y) is not a parametrization since 

F(;,,o) = jPo for every p. ► 

Henceforth, {P^ : 0 6 0} will denote a canonical parametrization of and, with 

( 9 ) = a(9)b(x}e^^^'^^ 


being the exponential representation considered, S will stand for the (common) 
support of Pet, 0 e 0, and C for the convex support, i.e. C = cl conv S. Moreover, c 
will denote the function defined on by 

ciO = J b{x) dp. 


Thus a(d) = c{9)~^ for 0e0. Finally, set k: = Inc. 

Theorem 8.1. For any 6e&, the Laplace tran^orm of Pot is 

( 10 ) ci^ + eyciO). 


IfOeintB then the statistic t has moments of all orders with respect to Pq and 


( 11 ) 


D'k{9) = K^O) 


where D is the differential operator, i = (ii, .. ., y is a vector of non-negative 
integers and /c.{0) denotes the fth order cumulant of t under Pq. 

Proof. Straightforward, using Theorem 7.2 and Corollary 7.2. ► 


In particular, one has by (11) 


( 12 ) 



(13) 


Vet 


d^K 

ddW 


where Vet is non-singular. 

In the majority of cases to be considered the representation (9) will be chosen so 
that 06 0 and a(0) = 1 . Then c and k are, respectively, the Laplace and cumulant 
transform of Pot* Furthermore, 
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( 14 ) 

Any exponential representation of this latter type is called a standard repre- 
sentation, and minimal standard provided (14) is minimal. 

Example 82. Let X = R and let ^ be of order 1 and linear. Thus ^ has 
exponential representation 

( 15 ) a{e)b{x)Q^\ 

Suppose that int 0 0. 

If for each member of the mean equals the variance then ‘jp is a subset of the 
family of Poisson distributions (Kosambi 1949, Biidikar and Patil 1968; some 
extensions of the result may be found in the latter paper). In proving this it causes 
no loss of generality to assume 0 g int 0 and a(0) = 1. Then, by assumption and 
formulas (12) and (13), 


K'i 6 ) = k"{ 9 ), 9 e int 0, 

and k( 0 ) = 0. Solution of this equation leads to the result. 

For another exemplification, let Xi,...,x„,... be independent observations 
following the distribution (15), set 


and assume that for each 0 g int 0 the mean value of Xi is positive. Again, suppose 
OGint0 and n(0) = 1. Asymptotically, for 0Gint0 and n^oo, x and the 
dispersion index s^/x have a joint normal distribution whose correlation is 0 if 
and only if 

(16) k'{9)k'"{9) - k'W = 0. 


Thus, for the Poisson family the statistic s^/x, which is often used to test the 
specification of Poisson distribution, and the statistic 3c are asymptotically 
independent. Solution of (16) shows that if ^ has this independence property 
then it is Poissonian in the sense that there exists a positive constant d such that 


Pe{Xi = dr} 



^ = 0 , 1 ,..., 


where X — Xq exp { 9 d}, Xq being a constant. ^ 

Let 3£ be an arbitrary Euclidean sample space, and let Pq be a probability 
measure and t , . . . , t^) a statistic on 3E. By the exponential family generated by 
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Pq and t is meant the family {P^rSe©} where © = dome, c being the Laplace 
transform of Pot, and where 

with a{Q) = 

Consider an arbitrary exponential family % suppose t is a minimal canonical 
statistic for and let Pq be an element of Then the family ^ generated by Pq 
and t does not depend on Pq and t, and furthermore ^ c: Hence it is reasonable 
to speak of as tht family generated by Note that ord ^ = ord If ^ 

then ^ is called full When ^ is given by an exponential representation 

dP 

ill) -^(x) = a(0)h(x)e'*'^">, 6 >g©, 

dji 

then ^ is full if 


0 = J < 00 

The converse assertion holds provided (1 7) is minimal. Thus, for a full exponential 
family any of the possible minimal canonical parameter domains is convex (cf. 
Theorem 7.1). 

is said to be regular if it is full and if for some (and hence for every) minimal 
canonical parametrization {P^iSg©} the set © is open. The reason for dis- 
tinguishing this concept by the name regular will become apparent later on. Most 
of the standard families of distributions are regular. Obviously, any full family ^ 
having finite support is regular since © = 

Example 8.3. Multivariate normal family. The family of r-dimensional normal 
distributions is the full exponential family on R’' generated by the standard 
normal distribution iV^{0, 1) with density function 

( 271 )“*’/^ 

and by the statistic 

t(x) = (x, xf x). 

The derivative of P(^, 2 :) with respect to P(o,i) may be written 
(18) 

A canonical parameter is therefore given by 
(19) e = 

and this relation together with the openness of r„ the set of positive definite r x r 



matrices 

regular. 
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|'r+ 1\ 

considered as a subset of 2 \ shows that © is open. Hence 91^ is 

► 


It is an immediate corollary of Theorem 7.4 that to any open convex subset © 
of there exists a regular exponential family ^ of order k such that © is the 
parameter domain for a minimal canonical parametrization of 
Suppose ^ is full and that (9) is a minimal representation of If the convex 

function k is steep then ^ too is said to be steep. By Theorem 5.27, k is steep if and 
only if 

(20) (0 ” 0) • Dk(W + (1 - X)d) -^00 as 2i0 

for every 0 g int © and every 0 6 bd ©. Clearly, k is steep either for all or for none of 
the minimal representations of % so that steepness is an intrinsic possible 
property of 

On account of Corollary 5.3 and Theorem 7.1 one has: 


Theorem 8.2, If ^ is regular then it is steep, ► 

An instance of a steep but non-regular family $ is provided by: 

Example 8.4. Inverse Gaussian family. Let indicate the continuous type 

distribution on R with density 

(21) x > 0. 

Here p and X are parameters varying independently, both in (0, 00 ). The mean 
value of (21) is /i, and A is a measure of precision. The distribution N~{p,X) was 
coined the inverse Gaussian by Tweedie. The class of inverse Gaussian distri- 
butions will be denoted by 9I“. 

Rewriting (21) in the form 

(22) (In) ~^x~ e “ 

where a = XI one sees that 91” is not full, the full family generated by 91” being 
obtained by allowing a to take on also the value 0. For a = 0, (22) becomes 


which is the density of the stable distribution with characteristic exponent f and 
scale parameter (see e.g. Feller 1966). Let denote the full family. 

It is apparent from (6) that the cumulant transform of the distribution with 
(a, A) = (0, 1) is given by 

K:(a, A) = i In A 4- >/(aA), (a, A) e [0, 00 ) x (0, 00 ). 

Thus K is steep, but 9l” is not regular. ► 

Let © be a minimal canonical parameter domain for The family ^ is called 
open, convex, or connected if © has the respective property; ^ has open kernel if 
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the interior of © is non-empty. These possible properties of ^ are all intrinsic, Le. 
they do not depend on the particular minimal canonical parameter domain 
considered. 

An exponential family ip is a power series family if it is linear and if the support 
S for the canonical statistic x (the identity mapping on 3£) is a subset of Nq. In this 
case the point probabilities of $ may be written 

(23) b{xW/g(A) 

where 


and 


A* = Af - • -Af 


gm = 46), 

g being the generating function for the coefficients b{x\ xeS. 

Example 8.5. Sum-symmetric power series families. Let the representation (23) be 
minimal and suppose ip has open kernel. Then the power series family is said to be 
sum-symmetric provided g depends on X through X^ = Xi + • • • + ^ only. This 
property is independent of which of the possible representations (23) is con- 
sidered. 

The class of sum-symmetric power series families was introduced by Patil 
(1968) and its properties have been studied in that paper and by Joshi and Patil 
(1970, 1971). (See Section 10.3.) 

Examples of such families are the multinomial family, the multivariate Poisson 
family, the negative multinomial family, and the multivariate logarithmic family, 
the latter having point probabilities 


where x € Nl\{0] and 7i e 11 with fl given by equation (2) of Section 3.3. Each of 
these four families is regular. 

Lemma 8.2. Let t be a minimal canonical statistic for ip . 

J/iP has open kernel then iPj is complete. 

Proof Suppose / is a t measurable function on X into R and that 


(24) 


j f{x)aie)b{x) e«-' d/i = 0, 6e®. 



First Properties 119 


Define the measures /z'*' and /z by 

diJ-- = f-(x)b(x)dix, 

where 

/*(x) = max{±/(x),0}, 

and note that (24) implies 

J e®''d/z'^ = J eP'‘dfi~. 

The lemma now follows from the uniqueness theorem for Laplace transforms 
(Theorem 7.3). ► 

Lemma 8.2 in conjunction with Theorem 4.5 or Corollary 4.4 yields, in a simple 
manner, a number of important results on stochastic independence. 

Example 8.6. An elementary illustration of Theorem 4.5 is provided by the r x c 
contingency table x**of independent Poisson variates and with no interaction, i.e. 
the so-called multiplicative Poisson model. The well-known conditional inde- 
pendence of the two marginals and given the total x is obtained from the 
theorem by fixing, for instance, the row parameters. ► 

Example 8.7. The independence of x and for a sample Xi , . . . , x„ from N{^, is 
seen by fixing and using Corollary 4.4. It also follows immediately from that 
corollary that x and s^ are independent of the B-ancillary statistic 



and of the empirical measures of skewness and kurtosis 
Qi = m^lmll^ and ^2 = ~ 3 

where m^ = Z(Xj — x)^/?i, (r = 2, 3, 4). ► 

Example 8.8. If Xi*,...,x„* is a sample of n from the r-dimensional normal 
distribution S) and if the variance E is a diagonal matrix then the empirical 

correlation matrix r is independent of 

^.1 x,,l,(xn - x.i)V(n - l)....,E(x,> - x,)V(« - 1)^ 

the set of estimates of the unspecified parameters. Again, this may be seen from 
Corollary 4.4, on noting that the distribution of r does not depend on c and the 
diagonal matrix E. ► 

See, moreover. Example 9.31. 
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Theorem 83. Suppose ^ is full and (P^: 0 € 0} is minimal Endow © with the usual 
Euclidean topology and ^ with the weak topology. Then 
(i) The mapping (p:6--^ Pq is continuous on int ©. 

( U} If there exists a minima! canonical statistic which is continuous then (p~^ is 
continuous on 

Proof. Let 

^(x) = 
dp 

be the minimal representation corresponding to {P^iSe©}. 

If -> 0 and 0„, 0 E int © then 




xeX, 


and this implies, by the so-called Scheffe Lemma, that Pq^ Pq (weakly). 

Now, suppose t is continuous. To prove continuity of cp''^ it must be shown 
that if Pq^ Pq then 0„ 0. Let Co(R^) denote the space of continuous functions 

on with compact support. From Pq^ Pq and the continuity of t it follows that 
PqJ PqU i.e. 

J fit) dPeJ - J At) dPet. fe Co(i?‘). 


For shortness, let n ==P0t. Now, 


jf{t)dPeJ = jAt)^^^'"-‘’'-'d7t 

and thus it suffices to prove that if 0 = Oe© and if for some sequence a„ 


(25) 

then 9„ ^ 0. 


fit) j fit) dn. 


/eCo(P"), 


On account of the compactness of the surface of the unit sphere in R^, one can, 
without loss of generality, assume 


Let 



where |^| = 1. 


Fid) = n{teR^:t-e < d'}, dsR. 

F is a distribution function with at least two points of increase, di and d 2 > di . In 
fact, if F had only one point of increase d, then 


\ ^%{t^e- d) — Po{t >e ^d) 
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in contradiction to the affine independence of Choose a 

3e(0,(d2 - di}/5) and let Ji = {t:di — 5 < t ^e < di 4- 5]. Furthermore, let /be 
a non-negative function in Co(i^^) which satisfies 

/(t) = 0, 

and 


fit)dn > 0. 

Such / certainly exists. Let Sf denote the (compact) support off. For n sufficiently 
large one has 





■t + e- t 


< mdi + 23). 


teSf 


and hence 


aj f(t)e^-'^d7i<aJ f fit)dTc. 

jRk Jsf JR;, 

By letting n oo and employing (25), one obtains 

(26) 1 < liminfa^e'^^l^^^-'^s)^ 

In a similar way it may be shown that 

(27) 1 > limsupa„e‘®"‘<^^-^^\ 

(26), (27) and the inequality dj + 2^ < ^2 — 25 together imply -> 0. ► 

In cases where ^ is full and (P^: 6 g 0} is minimal, t will denote the mapping 
defined on int 0 by 

T(0) = Eet 

and X will stand for T(int0). The mapping r is a one-to-one, both ways 
continuously differentiable mapping between the two open, connected sets int 0 
and X; moreover, t is strictly increasing in the sense that 

(28) (0 ~ g)(T(0) - T(0)) > 0 

for every pair of points 0, 0 in int 0. It follows, in particular, that if 0 is open (i.e. 
is regular) then ^ may be parametrized by the mapping x Pq (where t = t(0)). 
Such a parametrization is called a mean value parametrization. 

Besides canonical and mean value parametrizations of regular exponential 
families, also parametrizations which are, so to speak, a mixture of the two have 
interest. Consider a partition 9^^^) of 0 and the similar partition (t^^\ /^^) of i. 
Observing that t = Dk and applying Theorem 5.34 to k one finds that the 
mapping 
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129) 

is a homeomorphism. Thus furnishes a parametrization of which is 

called a mixed parametrization. Theorem 5.34 moreover yields: 

Theorem 8.4. //^ is regular then for any mixed parametrization of^ with partition 
components and 6^^^ are variation independent. ► 

Example 8.9. For the family 91^ of r-dimensional normal distributions the pair 
(s;. A), where A = E" \ is a mixed parameter. 

Another mixed parametrization of is determined as follows. 

Suppose X is normally distributed with mean c and variance L, let 
and be similar partitions of x and c, and let 




and A = 

Aul 


5^22; 

(A 2 , 

A22J 


be the corresponding partitions of the variance and the precision. Then 


^( 1 ) 




Q( 2 } ^ + C^^^A 22 , Ai 2 , A 22 ) 


constitute a mixed parameter. The variables and 9^^^ are, respectively, the 
parameters of the marginal (normal) distribution of x^^^ and of the conditional 
(normal) distribution of x^^^ given x^^^ (cf. formulas (1) and (2) of Section 1.1, and 
Example 3.7). ► 

Example 8.10. Genotype distributions. Let there be given a random sample of size n 
from an infinite genetical, diploid population. Suppose that a certain locus carries 

m allelic genes and let p,-, > 0 denote the proportion of individuals in 

the population having genotype A, A j. The probability that the sample contains 
Xij individuals of genotype A^ .4/ 1 < i < j < m) is 


(30) 


nl 




n pir- 


<J 


The population gene frequency of A, will be denoted by pi. Note that 


and 


Pi ~ Pii + Upii + ■ * * + Pi- 1£ + pii + 1 4- • • • + 4 - Pj.^) 


Pi + ■ ■ • 4- p^ = 1. 


If the population has been created through random union of gametes and if no 
selection has taken place then 


(31) 


Pii = Pi^ 1 < i < m 


Pij = ^PiPj^ I <i<j<m 


Formula (31) defines the hypothesis of Hardy-Weinberg distribution of the 
observed genotype numbers x-^.. 
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Let ^ be the exponential famiK given by (30). is regular and a convenient 
canonical (non-minimal) parametrization of ^ is determined by 

6 = (^1,. 


where 


l<i<m 


e =hn-^, 

2 4piiPjj 


1 < i <j <m. 


To this parametrization corresponds the canonical statistic 


t — , . . . , ^ 13 ’ ■ • * ’ ^m- In) 


where 


t, = 2xu 4- Xn + * • * + x,-_ li + Xa- + i + • • • 4- Xi^, l<i<m 
tij = x,j, l<i<j<m. 

Thus ti is the number of ylrgenes and tij is the number of A ^ /4j-heterozygotes in 
the sample. There is exactly one linear constraint between 1, 
which is Satisfied with probability 1, namely 

4- • • • -f = 2n. 


The hypothesis of Hardy-Weinberg distribution is equivalent to 


9ij = 0, 1 < f < 7 < m. 


and the parameters Oij express possible deviations from the hypothesis in a useful 
way: is positive if there is excess of A, ^j-heterozygotes in the population and it 
is negative in the adverse instance. Moreover, if the population was initially in 
Hardy-Weinberg distribution but has been subjected to selection, the fitness 
component of genotype A.Aj being w^-, then 


The pair 
where 


0,, = -ln 


wr, 




P. =(Pl,---,Pm) 


furnishes an interesting example of a mixed parametrization, and the variation 
independence of p^ and 0'^’ (cf. Theorem 8.4) seems worth noting. ► 
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Mixed parametrizations are of particular interest in connection with the 
notions of L-independence and cuts in exponential families, cf. Sections 9.2, 10.2, 
and 10.3. 

Throughout what follows, and will be similar 

partitions of 0 , t, and t, and the common dimension of t^% and is denoted 
by i = 1 , 2 . 

The next lemma, which will be particularly useful in Chapter 10, gives 
information on the relations between the various possible such partitions. 

Lemma 83* Let t = (t^^\ and t t^^^) be minimal canonical statistics for ^ 

and denote the dimensions of t^^^ and by and Furthermore, let 
6 = ^ _ ( 0 ( 1 )^ 0 ( 2 )^ corresponding minimal canonical para- 

meters and let 

f = f A + B, 0 = BA + B, 

be the affine connections whose existence was established in Lemma 8.L Here 
A' = Partition A 


An Ai2 
A21 A22 

such that An is a x matrix and assume that 
(a) for some value of 9^^^ 

dimaff©0(2) = 



and 

(b) depends on 6 only through 
Then 

(0 Ai 2=0 

(u) = ?'>A'n - B<'^A'n 

{Hi} = 6>^^^A22 + 

(iv) 

(t;) If then An An = Ifcd), A 22 A 22 = ^nd A 21 = 0. 


Proof. The assertions (ii), (iii), (iv), and (v) are immediate consequences of (i). 
To prove (i), observe that 

(32) F> = 0<i^Ai2+0^^>A22 + B<21 

According to assumptions, there exists some value of B^f^ say, such that 
dimaif = B^K Consequently one can find such that 0. = 
i - 0 , l,2,...,B^\ and such that 00^ — B^\ i = 1,2,..., B^\ are linearly inde- 
pendent. Using (32) and assumption (b) one obtains 



0 = - 0i,‘>)Ai2, 
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i=l,2,...,k^^^ 

and hence A 12 = 0. ^ 

8.2 DERIVED FAMILIES 

From a given statistical field (X, 91, iP) (with ip exponential) other statistical fields 
may be constructed, e.g. by margining or conditioning. Several important types of 
such constructions lead again to exponential families and may preserve regularity 
properties of the original family. This is discussed here, together with certain 
related results, in a sequence of subsections. 

Let 


(1) ^ (^) = 4d)Hx) e e © 

be a minimal exponential representation of ip with Oe © and a(0) = 1. 

(1) Affine hypotheses. Consider the field (X,9l,iPo) where ipo, which will oc- 
casionally be called the hypothesis, is a proper subset of ip. Any such iPo is 
obviously exponential. Let ©0 denote the subset of © corresponding to ip©. 
Clearly, ord iPo = dim aff © 0 . The hypothesis iPo is said to be affine if © 0 is of the 
form © n L where L is an affine subspace of R'‘. 

Theorem 8.5. Let iPo be a subfamily of'^. If iPo is full then ^0 Is affine. The converse 
assertion holds provided ^ is full. 

Proof. Since the properties of being full or regular or affine are intrinsic, it causes 
no loss of generality to assume that 0e©o and that aff©o is of the form 
R*'*’ X {0}. Then, writing 6 = and t = with and of 

dimension one has that 

= a(0(i>,O)e«“’-'"' ((0(i>,O)e©o) 

axo 

is a minimal standard representation of ^o- Let 

©o = {(C“>,0):J e<”’-'">dPo<oo} 

and note that 

( 2 ) ©0 c: © n aff ©0 c: ©0, 

Clearly, ^^full means ©q = ®q, and ^ full implies = ©n aff ©o- The theorem 
now follows by (2). ^ 
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Corollary S.2. Suppose regular. Then Pq is regular if and only if it is affine. 

Let ^ be regular and let be the affine hypothesis obtained by fixing at 
some value 01)^1 Moreover, let Zq be the subset ofZ corresponding to ^o* It will 
be shown in Section 9.1 that regularity of ^ implies % = intC. Using this, 
Theorem 8.4, and Corollary 8.2 one finds that the projections of 3: ( = int C) and 
of Xo on X {0} are identical and equal to intCo, where Co denotes the 
convex support of the distribution of 

It follows, furthermore, from Theorem 8.1 and equation (20) of Section 8.1 that 
if ^ is steep and is affine then is steep. 

Generally, of course, for any concrete model ^ some affine hypotheses are 
more interesting than others. 

Example 8.1 1. Affine hypotheses of%. For the class 91^ of r-dimensional normal 
distributions those hypotheses iPo which are affine with = 0 for some element of 
‘iPo and which have the property that ^ and A are variation independent under 
are of particular interest. 

For any hypothesis of 9^^ S and A denote the sets of values of ^ and A 
under iPo? and let be the set of those i; for which ((;, A) corresponds to an 
element of • Note that if is affine then the sets E^A, A g A, are affine and 
parallel and A is an affine subset of the set F,. of positive definite r x r matrices. 
Using this remark and equation (19) of Section 8.1 it is trivial to show: 

Theorem 8.6. Suppose ‘iPo is of the form A)6E x A} with OgE. 

Then is affine if and only if A is an affine subset ofTr and S is a linear subspace 
such that SA is independent of A for AgA- 

Suppose iPo is a linear hypothesis, i.e. &o = (^nL where L is a linear subspace 
of R^. Denoting the projection onto L by P one has 

dP 

(3) -j^(x) = eie©o. 

dfi 

If aff ©0 = L then Pt is minimal sufficient and, interpreting P0 and Pt as vectors 
of dimension Jcq = dimL, (3) is a minimal representation of ^o- 

Example 8 .12. Let x be an r-dimensional normal variate whose family of 
distributions is of the affine type discussed in the previous example, i.e. 
^0 = :(C. A) = S x_A] with OgE and is affine. 

A minimal canonical statistic is then given by 

(P‘1>X, P‘^>(ixf 1, . . . ,^xi, X 1 X 2 , . . . , XiX., . . . , X,_ iX,)) 

where P‘^’ is the projection in R'anto 3 and P<^’ is the projection in ^ onto 
the linear subspace parallel to affA- ► 
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(li) Productfamilies. For any neN, the product family is exponential with ord 
^^"^rrord'iP, minimal canonical parametrization {P^Q^:deS} and minimal 
standard representation 


where x‘"' = (x ^ , . . . , x„). Thus, for a sample x ^ , . . . , x„, a minimal canonical (and 

minimal sufficient) statistic is given by t{xi) + h t(x„). 

The family is full if and only if ^ is full. 


(iii) Marginality, The family of marginal distributions of t is exponential with 
ord (^t) = ord ^ and minimal standard representation 



a(0)e^•^ 


The restriction of to the cr-field generated by the component of t has 
minimal representation 


(4) 

where 


dP oj(1) 






(cf. equation (4) of Section 1.1). This family, and hence the family of marginal 
distributions of is not in general exponential (although, of course, each of its 
subfamilies determined by fixing the value of is exponential). 

Example 8.13. Let w be a random variable following a Poisson distribution with 
parameter A and suppose y i , ^2 ? • • • are stochastically independent, mutually and 
of w, each following the logarithmic distribution with point probabilities 


Set 


(-ln(l -7c))“^y"V y = l,2,... . 


u = yi + • • • + y„. 

The distribution of t = (u, v) has point probabilities 

(1 - 7cYhiu,v)xV; 

here 


Z = A/{-ln(l-7r)} 

while b{u, v) is the coefficient of in the power series expansion of ^). Thus 
the family ^ of distributions of t as (A, n) varies in (0, co) x (0, 1) is exponential (of 
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order 2 and regular). However, the marginal distribution of v is the negative 

binomial 


(l^nY 


)! + »-> K 

V 


(Jones and Mollison 1948) and hence is not exponential ^ 

It may be noted that if t^^Hs a cut, then is exponential. On the other hand, 
if is exponential of order and if ^ is regular then t^^'^ is a cut. To see the 

latter, fix at some value and denote the corresponding subfamily of ^ by ^^( 2 ) . 

On account of Theorem 8.5, %( 2 ) is full and hence %( 2 )t^^^ must be full. Using 
again Theorem 8.5, one finds that is an affine subfamily of and since 

both are of order one has whatever the value of Thus 

is a cut. 

The densities of with respect to a cr-finite measure p (typically Lebesgue 
measure or counting measure) are given by 

(5) pit;d) ^ a(e)p(t)^'’-' 

where 


When in concrete cases the densities are sought one will naturally look for a Pq 
which makes the determination of p(-) relatively simple. 

Example 8.14. Wishart-density. In order to derive the density of the Wishart- 

distribution W^f, E), i.e. the density for t = x'jXi -| 1- where Xi , . . . , xy is 

a sample from N^O, E), one may conveniently proceed as follows. Letting ip be the 
family of distributions of (xi,...,:!^-) and letting Pq correspond to E = I one 
obtains from (18) of Section 8.1 and (5) that the density of W^f, E) with respect to 

Lebesgue measure on ^ ^ is 


so that the problem is reduced to that of finding p(£), the density of W;(/,I). In 
deriving the latter one may draw on the independence properties which hold 
when E = I (cf., for instance, p. 597-598 in Rao (1973)). ^ 

Example 8.15. Resultant density. For a sample from the von 

Mises-Fisher distribution (equation (8) of Section 8.1) the density of the so-called 
resultant v. with respect to Lebesgue measure on R* is 

a(x)"e^'‘‘’-p(«.) 

where p is the density under the uniform distribution on the unit sphere S^. 
Clearly, p(v) must be proportional to times the density of |t)J under the 
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uniform distribution. An expression for this latter density in terms of the Bessel 
function is available, cf. Mardia (1975). ^ 

More generally than (5) one has that for any statistic w on X the densities of 
with respect to a dominating measure ^ are determined by 

(6) P(“ ; 0) = a(0)£o(e® ■ ' \u)p{u) 

where 


P(-) = 


dPpU 
dfi ■ 


This is an immediate consequence of equation (4) of Section 1.1 and equation (1) 
of the present section. 

Example 8.16. Resultant length density. In the situation of Example 8.15 the 
density of the resultant length r = |u.| with respect to Lebesgue measure is 

aixfaixr) ~^p{r) (0 < r < n), 


p{r) being the density for the isotropic case = 0, which was mentioned in 
Example 8.15. To see this all one has to note is that for % = 0 the distribution of 
must be uniform on S^, whence 

Eo(e^^'^-ir) = - u(xr)- ► 

Example 8.17. Non-central -density. Let Xj , . . . , be independent and normally 
distributed with mean values i , . . . , and a common variance To determine 

the density (with respect to Lebesgue measure) for « = xf -f h x^, note first 

that the density of x,^ = (xi , . . . , x„) may be written 

cj“" • • • (p{Xn) 

where cp denotes the density of A(0, 1). Hence, it is possible to apply (6) with 
t = (x,^, m) and Pq corresponding to (^:^, cr^) = (0, ... ,0, 1). Under Pq the statistic u 
follows the ^^-distribution with n degrees of freedom and for general ((^^,c 7 ^) the 
density of u is therefore 

(j"" *P/<T"r(n/2)" ^ 

Under Pq and given u, the statistic u~^x^ is uniformly distributed on the unit 
sphere and so 





-1 


the latter quantity being the normalizing constant for the von Mises-Fisher 
distribution on S„ with x = (|^*|/o‘^)w (cf. Example 8.1). ► 

A further application of formula (6) may be found in Example 9.32. 



130 Introductory Theory of Exponential Families 

(i?) Conditionality. On account of (5) in Section 1.1, one has 

where 

Since the conditional distribution of given depends on 6 through 9 ^^^ only 
one may write p 0 < 2 )(*|t^^^) instead of Let be the set of values of 9^^^ for 9 

varying in 0. It follows from (7) that for a fixed value t^^^ the family 

is exponential with and t^^^ as canonical quantities. 

Example 8.18. Let follow the N{^i,a^) distribution, i= l,...,/c, and suppose 
Xi,...,Xjt are independent. The minimal canonical statistic is and 

is the corresponding canonical parameter. Taking = |x^|^ 
and observing that for = • • • = = 0 the conditional distribution of = x^ 

given |x*p = 1 is the uniform distribution on the unit sphere one sees at once 
from (7) that the class of conditional distributions of x* given |x*| = 1 and = 1 
is equal to the family of von Mises-Fisher distributions on Sk (cf. Example 8.1). 

Fullness of does not necessarily imply fullness of as is easily shown 

by example. 

If the mapping 9^^^ Pe( 2 ){'\P^^) is one-to-one then 9^^^ is said to parametrize 

the conditional distributions given 


Lemma 84. Assume that the conditional distribution of given is 
nonsingular. 

Then 9^^^ parametrizes the conditional distributions given T^^\ 


Proof. Suppose 

Then (cf. the definition of singular conditional distributions given in Section 1 .2) 
there exists a value t^^^ of T^^^ such that under Po(*|^^^^) the coordinates of are 
affinely independent and 

^(0(2)|f(i))e^(2).ra) ^ a(0^(2)|j(i))gec2).r(2) 

whence, by the remark preceding Example 8 . 1 , 9^^^ must equal ► 


(v) Truncation and censoring. Let ^ e 91 and suppose that 0 < Po(^) < 1- Then the 
family = {Pf. 9 e ©}, where P ^ is the truncation of P^ to A, is exponential and 
has standard representation 



0(9; A) 
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where 

a{9-Ar^ = 

Obviously, the order of may be less than the order of ^ since even though 
affinely independent their restrictions to A may well be affinely 
dependent. Note, furthermore, that $ full does not necessarily imply full (even 
if ord = ord iP). However, if ® = then every truncation of ^ is full 

Example 8,19, Doubly truncated normal family. The class 91 of one-dimensional 
normal distributions has minimal canonical parameter and the 

domain of this parameter is R x (0, oo). Truncation to a finite interval yields a 
non-full family. Indeed, since the range of the minimal canonical statistic (x, x^) is 
bounded when x is restricted to {a, b), where - oo<a<b<QO, any minimal 
canonical parameter domain for the full family generated by the truncated 
distributions must equal R^. The situation is quite similar if a whole sample 
Xi, . . . , x„ of observations is considered. ^ 

Let denote the censoring of P^ to A. The family ip""^ = {P|^:06©} is 
exponential of order < ord ^ -h 1 and 

UIq 


where 


(t{9) = In 


Pe(A^) 

mPoiAr 


Example 8.20. Censored exponential lifetimes. Suppose an individual or item has 
lifetime t and that this lifetime is assumed to be exponentially distributed with 
hazard rate parameter X. If, as is rather often the case in practice, observation 
consists in recording t provided this value is less than or equal to some prefixed 
time to and otherwise noting merely that t> to then one has an instance of 
censoring, of the exponential distribution to the interval A = [0, to], and the 
family of censored distributions is exponential of order 2. ► 


(vi) Conjugate families. Suppose ^ is full and let 
(8) n(0)h(x)e^*'^*^ 

be a minimal canonical representation of Consider the conjugate family, i.e. the 
family of distributions on 0 having densities with respect to Lebesgue measure 


( 9 ) 


diy,x)o{^y 
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where yeR and x^R^ are parameters and d{y, x) is a normalizing constant; the 
domain of variation of (y, x) is the set 

E = {(ya)- f < oo}. 


Note that the conjugate family is not determined by $ alone but depends on 
the chosen minimal canonical representation. However, by transferring the 
conjugate distributions to distributions on ^ via the one-to-one mapping 6 
one obtains a family of distributions which is the same whichever repre- 
sentation is considered. Clearly, is a full exponential family. 

From Theorem 6.1(i) and (iv) and from the relation int dom k* = int C, which 
follows from Theorem 9.1 later, one finds for the section of £ at y the expressions 


( 10 ) 




yintC 


int bar © 


fory > 0 


fory = 0. 


It is possible, in general, to have Ey non-empty for some or all negative values of y, 
but no simple, general expression for Ey is available for y < 0. It follows, however, 
from (10) that ifO ^ R^ then E = {(y,x)-? > C}. Note also that (10) 

implies ord^* - k + 1, 


Example 8.21. If ^ is the family of one-dimensional normal distributions with 
known variance then in effect equals 91. ^ 


Example 8.22. Let ^ be the Poisson family having model function 

e . 
x! 

The family of distributions of 2 induced by is the class of gamma distribu- 
tions 

r{x)f^ e {x,f)e(0,ooy. 

Example 8.23. Take ^ to be the family of multinomial distributions with model 
function 


Xi ! — Xfc!(n — x)! 


Under the parameter n = {ni,...,7Ck) follows the family of Dirichlet distri- 
butions 


Fjh “h ^ h+i) xi-i 






Further examples may be found in Raiffa and Schlaifer (1961), and Ando and 
Kaufman (1965). 
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It is simple to see that 

If 0 is considered a random variable with density (9) and if Xx,..,,x„ is a 
random sample from I then the probability function of the marginal distribution 
of = is 

(11) = d{y, x) |n /d(y + n, X + t) 

where = t(xi) H + t{x„). Furthermore, the probability function of 0 given 

is 

(12) p(0lx<"^) = d(7 + n, X + Oa(ey^^ 

In other words, (12) is the posterior distribution of 0, with (9) as prior. One notes 
that this posterior distribution, as well as the prior, belongs to the con- 
jugate family. However, the family of distributions (12) as (y, x) varies over E may 
be a proper subset of the conjugate family. 


8.3 COMPLEMENTS 

(i) Let ^ be exponential of order k and with minimal standard representation as 
equation (1) of Section 8.2, and let w be a statistic. Suppose it is desired to find the 
conditional mean value of u given £, such as may be the case, for instance, in a 
control of the model Owing to the sufficiency of £, this conditional mean value 
does not depend on 0 and may thus be denoted by E{u\t). 

A derivation of E{u\t) via a determination of the conditional distribution of u 
may seem too complicated. However, if the mean value EqU is a known function of 
0, E{u\t) can sometimes be obtained fairly simply by Laplace transform inversion 
since 

(1) J E{u\t)a{0) e^’^ dPo = EqU. 

(Recourse may of course be made to published lists of Laplace transforms and 
their inverses, such as Gradshteyn and Ryzhik (1965) and Roberts and Kaufman 
(1966).) 

Example 8.24. Assume that x = (xi,...,x„) where Xi,...,x„ are independent, 
identically distributed Poisson variates. Here t = x and (1) takes the form 

f; = Exu. 

x .=0 • 

The conditional mean value E(u\x) may be found by expanding exp (nl) E^u in a 
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power series in A and identifying coefficients. For instance, for w = -- x 

hence E^u = (n- 1)1 one sees that 

E(L{Xi — 3c)^|x) = - — -X . 

' ‘ n ' 


Thus 


E{s^/x\x ) =1, X. > 0, 

where s^/x is the dispersion index. The conditional variance of the disp^ 
index is, by this technique, obtained as 

F(s^/x|x) = 2(1 — x.“^)/(n — 1), X. > 0. 


(Further formulas for conditional moments of /c-statistics (sample cumulan 
the Poisson distribution are given in Gart and Pettigrew (1970).) 


Many other examples may be drawn from the literature on unt 
estimation, E{u\t) being the minimum variance unbiased estimator of EqU. ! 
particular Washio, Morimoto, and Ikeda (1956). 

Note that if w, t, and Pq is any set of two statistics and a probability mej 
and if Eo{u\t) is sought then the above method is potentially applicable, thi 
introduction of the exponential family generated by Pq and t. 

The techniques described here were indicated by Tweedie (1946, 1947), th 
his papers have been largely unnoticed. 


(ii) Factorial series families. A parametrized family Q — {Q^:riEH} ^ 
H c: No is called a factorial series family if foi rjeH the support of is cont 
in Nq and if there exist a positive function g on H and a non-negative iunctio: 
Nq such that the probability of xgNq under is of the form 

( 2 ) b{xW^yg(rj) 

where 111® notation indicates the desce 

factorial, i.e. = n{n— l)***(n — m +1)). With i denoting a point in iN 
S = {i: b{i) > 0} and T - {jeNo: 0 < Ji < zh, . . ., 0 < jk ^ zj. Note that th( 
port of in general varies with rf and that S^ = fjr\S. For simplicity, su 
that there exists a /c-dimensional cube of unit side length whose vertices all b 
to S, and that the family Q is full in the sense that H = {rjeNoitj nS y 
Defining girj) for all rjeNl by g(ri) = i:b{x)ri^^\ one has 

b{x) = A^^(0)/x! 
and 


E,x^^ = fj^^Vgirj)lgifj) 

where V and A denote, respectively, the descending and ascending diffe 
operator (of dimension k). 
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The factorial series families were introduced by Berg (1974, 1977), who noted 
their similarity to power series families, studied some of their properties, and 
showed how they arise as the result of certain sampling procedures employed to 
obtain data for inference on the sizes of various classes of elements or individuals, 
the parameters ??i , . • . , denoting these sizes. 

Example 825. Multivariate hypergeometric distributions. The probability function 
of the multivariate hypergeometric distribution with indices m and n< m (where 
m,neN) and size parameters given by 

\xj \xj\n-xl 
/fj + m\ 

\ n ) 

and this is of the form (2) with 


difi) = 


rj + m\ 
n / 


► 


Example 8.26. Binomial distributions. For a fixed value n of the probability 
parameter, the family of binomial distributions with trial parameter ne Nq 
is a full factorial series family for which g{n) = (1 — ti)"". ► 


Example 8.27. Matching. Suppose n balls marked l,2,...,?z are distributed at 
random among n urns, also numbered 1,2,..., n, in such a way that each urn 
contains exactly one ball. The probability of observing x non matches is 


i = 0 

which is of the form (2). ► 

Example 8.28. Multiple-recapture census. On I successive occasions a random 
sample is removed from an animal population and each time the individuals 
removed are furnished with a tag, unless they have already been tagged at one of 
the previous occasions, and are then returned to the population. Here the tags are 
supposed to be identical. Let and Ui denote, respectively, the sample size and the 
number of unmarked individuals in the sample at the ith occasion, and suppose 
Wp . . . , can be considered given. The u — u^ is a minimal sufficient 

statistic with respect to size rj of the population and the probability of u is 


I (-ly/y! n - 


U=o i=i 





(n.) 


cf. Berg (1974). Thus u follows a distribution from a factorial series family with 
g{t]) = ► 
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If the function g in (2) depends on rj through fj =rii + '- + f]i, only then the 
factorial series family is said to be sum-symmetric. The multivariate hyper- 
geometric family of Example 8.25 is an instance of this. 

(iil) Infinite divisibility. Let "ip be a full and linear exponential family of 
distributions on and suppose ord ^ == k. 

If one member P of “ip is infinitely divisible then so is every other member. 

To see this, let P„ be the probability measure on whose nth convolution 
equals P and let % be the full exponential family generated by P„ and the identity 
mapping x on R^. The family obtained by convoluting each member of with 
itself n times equals % cf. Section 8.2(ii), and this establishes the result. 


8.4 NOTES 

Much of the material in this chapter belongs to the folklore of statistics. Previous 
accounts of fundamental properties of exponential families may be found in 
Lehmann (1959), Witting (1966), Chentsov (1966, 1972), and Barndorff-Nielsen 
(1970). The concepts of regular exponential families and mixed parametrization 
were introduced in Barndorff-Nielsen (1970), and Theorem 8.3 is from Barndorlf- 
Nielsen (1969). For discussions of the role of conjugate families (cf. Section 8.2(vi)) 
in Bayes statistics, see Raiffa and Schlaifer (1961) and Lindley (1971). (In 
statistical mechanics and certain probability theoretical contexts, the term 
conjugate family is used in a sense different from that given in Section 8.2(vi), 
namely to denote the exponential family generated by a probability measure Pq 
and a random variable, cf. Section 8.1. See, for instance, Keilson (1965), Feller 
(1966), and references mentioned there; Feller employs the word associated 
instead of conjugate.) 

As mentioned in Section 1.2, the notion of exponential families first occurred in 
Fisher (1934). Fisher argued — ^in his characteristic, mathematically somewhat 
imprecise, manner-that families of (one-dimensional) distributions which 
admit a sufficient reduction have to be exponential, and he also indicated that 
these families were the only ones supplying uniformly most powerful tests in the 
sense of Neyman and Pearson. Many subsequent papers, by other authors, have 
been concerned with the mathematical questions left open by Fisher in making 
the former of the two claims. The first of these papers were by Darmois (1935), 
Koopman (1936), and Pitman (1936) (and exponential families have sometimes 
been referred to as Darmois-Koopman or Fisher-Darmois-Koopman-Pitman 
families.) For discrete type families no mathematically and statistically satisfac- 
tory formulation has been found, whereas a fairly adequate discussion can be 
given in the continuous type case, see Dynkin (1951), Brown (1964), and Hipp 
(1974) (generalization to higher dimensions is considered in Barndorff-Nielsen 
and Pedersen (1968)). The second of Fisher’s claims has been treated fairly 
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recently by Pfanzagl (1968); see also the comments by Neyman and Pearson 
(1936) and Bartlett (1937). 

Exponential families, or at least certain types of such families, are also arrived 
at from various starting points other than those of sufficient reduction and 
uniformly most powerful test. Boltzmann’s law in statistical mechanics (see e.g. 
Khinchin 1949) is a result to this effect. Martin-Lof (1970, 1974a,b,c, 1975) has 
proposed an adaptation and development of the reasoning connected with 
Boltzmann’s law for use in statistics. A related way of constructing families of 
distributions is that of maximum entropy or minimum discrimination infor- 
mation estimation; this in turn has been shown to be formally translatable into 
the construction, originating with Gauss, of distribution families from the 
requirement that the arithmetic mean of a specified statistic should be the 
maximum likelihood estimator, and the families yielded by these two methods are 
exponential (see Campbell 1970 and the references therein). Finally, it may be 
recalled that the Cramer-Rao lower bound for the variance of an unbiased 
estimator is attained only for exponential families, see Barankin (1951), Chentsov 
(1972), and Wijsman (1973). (All the above-mentioned derivations of exponential 
families are, of course, subject to regularity conditions.) 

Various generalizations of exponential families are discussed in Crain (1974), 
Johansen (1977), Lauritzen (1975), Nelder and Wedderburn (1972), and Soler 
(1977). 




CHAPTER 9 

Duality and Exponential Families 


For exponential families, sample-hypothesis duality and lods function theory 
(Chapter 3) combine intimately with the mathematical theory of convex duality 
(Chapters 5 and 6) and as a result it is possible, employing those theories, to 
establish a considerable number of statistically useful, general properties of 
exponential families, in a unified way. 

Throughout the present chapter the sample space 3E is, for convenience, taken 
to be the whole Euclidean space R\ the exponential family ^ is assumed to be of 
order k and linear, and only minimal representations of ^ are considered — unless 
explicitly stated otherwise. With the canonical statistic t being the identity 
mapping on 3£ = R\ the representations are thus of the form 

(t) p(f,d) = a(d)b{t)e^’ (teR’^). 

(The motivation for using t, and not x, to denote sample points and the identity 
mapping on the sample space is that in applications of the results of this chapter 
the family ^ will often have been arrived at by margining to the minimal sufficient 
and canonical statistic of an underlying exponential family.) The dominating 
measure with respect to which the densities (t) are derived will, as usual, be 
denoted by p. Further, it will be supposed that b{t) = 0 for t ^ 5 (the support of ^), 
and that Oe© and u(0) = 1. Thus b{t) — p(t;0) and this function will also 
occasionally be denoted by p{t). Set (p{t) = — In b{t) = — In p(r). 

The basic parts of the convex duality theory for exponential families are 
presented in Section 9.1. 

Section 9.2 contains some results on stochastic independence and likelihood 
independence in exponential families. 

The lods functions to be studied, in Sections 9.3-9.6, are the log-likelihood 
function 


m = 1(6 = K{d) - 5(01©) (0 G R^) 

(where k = —In a), the log-probability function 

6^t-(p{t) (teR^) 
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(where (p = -Infc), and the log-plausibility function 

7t(e) = 71(9; t) = 0 • r - <p*(0) ~ 5{e\e) (9 £ 

(where is the conjugate of cp). 

Note that both 0 • r — k{9) and 0 • t — <p*(0) are closed concave functions on 
the first because k is the logarithm of a Laplace transform (cf. Theorem 7.1) and 
the second due to the fact that the conjugation operation always yields closed 
convex functions (see the beginning of Section 5.3). It will be shown in Section 9.5 
that 9 't - (p{t) coincides on dom cp with a concave function if and only is 
universal. 

(For the study of the likelihood functions of an exponential family the 
assumptions made here obviously cause no loss of generality. It may also be 
noted that even for a nonlinear exponential family with representation 
ai9)b{x)exp{9 • t{x)} the function (p*{9) = sup^cexl® * + Inh(x)} is closed 

convex.) 

Prediction functions under exponential models are briefly discussed in Section 
9.7. 


9.1 CONVEX DUALITY AND EXPONENTIAL FAMILIES 

In this section ^ is assumed to be full. Recall the notations C = cl conv S and 
Z = T(int 0). The cumulant transform ?c is a closed convex function and the 
conjugate of k is the ‘sup-log-likelihood function’, to be denoted by f. Thus 

K*(t) = sup {9 - t — k{9)} = sup l{9; t) = I(t), teR^. 

6 e 

Similarly, cp* is closed convex and its conjugate is, under mild regularity 
conditions, equal to the ‘sup-log-plausibility function’ tt, i.e. 

(p**{t) — sup {9’t — (p*{9)} = sup 7c(0; t) = 7t{t\ teR^. 
d e 

The discussion in the present section centres around the two pairs of functions k 
and r, <p* and and the pair of sets © and C. Each of these pairs is investigated for 
statistically interesting properties, under the viewpoint of convex duality. The 
results derived will be drawn upon in subsequent sections. 

It may be noted that, letting and qji denote the likelihood and plausibility 
ratio test statistics for the hypothesis 0 = 0, one has / = —In^L 
a = — In^n + where dis 2 i constant (d = —Insup^ b(t)). Thus 1 and n both 
have an immediate statistical interpretation. 

The ‘likelihood pair’ k and / will be studied first. 

Theorem 9.L One has 

(i) K* = I and I* = k. 

(ii) K is a closed and strictly convex function with dom k = &. 

(ii)* 1 is a closed and essentially smooth convex function withint C a dom I c C. 
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Proof. Assertion (ii) is just a reiteration of Theorem 7.1. 

The equality K* = f was pointed out above, and since k is closed /* = k** = k. 
The essential smoothness of t is a consequence of Theorem 5.30. 

To prove dom I c C, consider a point t^C. Let if be a hyperplane separating 
C and t strongly, and let e be the unit vector in i?*' which is normal to H and such 
that C lies in the negative halfspace determined by H and e. Setting d — e -t one 
has d > 6*{e\C) and hence, by (5) of Section 7.1, 

/(re; t) = rd — In c(re) -* oo as r -j- oo. 

Consequently, !{t) = oo and t^domf. 

If t is a point in int C then t e dom l,i.e. 6 ■ t — k{6), considered as a function of 9, 
is bounded above. This follows immediately from the next two lemmas. ► 

Lemma 9.1. For any deR\ teJ?* 

9-t-K(9} < -]np(t) 

where 

p(f) = inf Po{e-T>e-t} 

theinfimum being taken over all unit vectors in R*. Hence tedomlif p{t) > 0. 

The proof of this lemma is simple and will not be given. ^ 

Lemma 9.2. Suppose feint C. Then inf Po{e- T> e -t} > 0, where the infimum is 
taken over all unit vectors e in i?*. 

Proof. By Theorem 5.3 there exists a natural number m ^ 2k and m points 
in S such that f e int convj.s, 9„]. Let H(e) denote the hyperplane 
through t with normal e, let <5, .(e) be the distance from s,. to H(e) and set 
(5(e) = niax{<5i(e), . . . , ^„(e)}. The mapping d:e — > 3(e) defined on the set of unit 
vectors is a continuous mapping on a compact set; thus it attains its infimum Sq, 
say, and clearly (5o must be positive. Consequently, for every e the closed positive 
halfspace determined by ff(e) and e contains at least one of the spheres B(s\, 3„) 
with centre s,- and radius 3o,i = l,2,...,m, and each of these spheres have positive 
Po measure since s.-e S. Hence 

inj Pole. T > e. f} > min {Po(B(s„ 3o}):i = 1, . . ., m} > 0. ► 

Note that if 9 and t satisfy 

9 = Dl(t) or, equivalently, tedK(9) 
then (cf. (2) of Section 5.4) 


k( 9) + !(t) = 9 -t. 

Condition (1) is fulfilled, in particular, if 


0eint©, t = r 
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The family of these distributions, for (^ 1 ,^ 2 ) varying in (0, oo)^ is regular 
exponential, with t{x) = (lnx,ln(l — x:)) as minimal canonical statistic and 

C = {t: ti < 0, t 2 ^ In(l - e'O}. 

Moreover, the mean value of t is 

T = mX,) - il/iXX iPiX,) - m)) 

where ij/ denotes the digamma function. It follows from Theorems 9.2 and 9,3 
that the range of t equals int C. (A direct proof of this from (4) would not be 
trivial.) /.a ^ 

By Theorem 9.1(ii)* dom/cC and int(doml) = C. The results mentioned 
next throw light on the question of when a boundary point of C belongs to dom / . 

Theorem 94. Suppose S is a finite or countable set. Then com S cz domf. 

Hence, in particular, dom I = C if S is finite. 

Proof Any point t e com S can be written as a convex combination of /c -f 1 
points Si,...,Sk + i from S (Theorem 5.3) and every closed halfspace H, whose 
corresponding hyperplane contains t, must contain at least one of the point 
s 1 , . . . , + 1 . Thus 

p{t) = inf Pole- T > e* t} > min{Po{si}: i — + 1} > 0 

e 

and hence, by Lemma 9.1, t e dom /. 

When S is finite, C = conv S which implies dom f = C ► 

Theorem 9.5. Let t he a boundary point of C. If there exists a hyperplane H 
supporting C at t and satisfying Po{H) = 0 then t^doml. 

Thus, in particular, dom 1 = int C if Pq is absolutely continuous with respect to 
Lebesgue measure. 

Proof This result follows easily from formula (5) of Section 7.1. ► 

Next, the ‘plausibility pair’ cp* and k will be discussed. For this discussion ^ is 
supposed to be of either discrete or continuous type, and in the latter case the 
probability functions pf;9), 06 0, are assumed to be densities with respect to 
Lebesgue measure 

Note that, by formula (2) of Section 5.3, 

(p* = (conv cp)* and (p** = cl conv cp, 

and that if S is finite then conv cp is closed because it is finitely generated, cf 
Rockafellar (1970), Corollary 19.1.2. 

One finds in partial analogy to Theorem 9.1: 

Theorem 9.6. Suppose that snp^p(t;6) < co for every 6 e&, and that either S is 
finite or a subset of or is of continuous type. Then 
(i) (p** = jt and n* = cp*. 
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(ii) (p* is a closed convex function with © c dom (p* c cl 0. 

(ii)* % is a closed convex function with S cz dom it c: C. 

Proof That the domain of the closed convex function (p* includes © is the same as 
the first of the stated suppositions. To prove dom c cl © (which is trivial for S 
finite), suppose 0 e ©, then 

J = 00 

and, since = cl conv (P ^ (p^ this implies 

= 00 . 

In the continuous type case one may conclude from this, on account of Theorem 
6.1(i) and (iv), that 9 $ int dom 9 *. The same conclusion is reachable for S c Z* by 
invoking Theorems 6.2 and 6.1. This establishes assertion (ii), and (i) and (ii)* 
follow simply. ^ 

Theorem 9.7. Suppose comcp is closed and let ^ be of either c-discrete or 
continuous type. 

If ^ is strongly unimodal then 

© = int dom <p* 


and hence is regular. 

Proof. Apply, again, Theorems 6.1 and 6.2. ► 

By the same reasoning as that yielding int dom (^* == © one finds, more 
generally, that if ^ is c-discrete and if (p is any closed convex function which 
coincides with cp on S (whence ^ is strongly unimodal) and for which 
S = Z^n dom 9 then 

(5) ©=intdom<p*. 

Theorem 9.8. Suppose ‘p is of continuous type and that (p is a closed convex function 
(whence "jp is strongly unimodal and regular). 

If the convexity of cp is strict then (p* is differentiable on © and steep. 

If q> is differentiable on intC and steep then cp* is essentially strictly convex. 

Proof Use Theorem 5.30. ► 

Further results on k, f, cp* and it are given in the following sections. 

The theorem below gives some information on the relation between © and C. 

Theorem 9.9. Let Pq be a probability measure on dt and let t be a k-dimensional 
statistic. Furthermore, let ^ denote the exponential family generated by Pq and t. 
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let C be the convex support of P^t, and set 

0 = {0: Je® ‘dPo < oo}. 

Then 

(6) bar 0 n bar C = {0}. 

Furthermore 

(7) barCc:O^0 
and, dually, 

(8) bar0cO'^C. 

Proof. Suppose e is a unit vector contained in bar 0. Then there exists a positive 
number r such that 


J dP^ = 00. 

But this is possible only if the set {e -titeC} is unbounded above, i.e. e^bar C. 

To prove (7) suppose ^ is a unit vector in bar C and let 6o be an arbitrary point 
in 0. Then <5*(^|Q < x and 



for all r ^ 0. 


Thus ceO'^©. 

From (7) one obtains 

{bsLvCf 

By (5) of Section 5.1, (bar C)^ = O^C and by (5), (4), and (3) of Section 5.1, 

(0^0)^ =)(O-*-cl0)® 

= (barcl0)^^ 

= cl bar cl © 
bar© 

whence (8) follows. 

(If it has been assumed that ord ^ = /c then the theorem could be proved by 
appealing to Theorems 5.19 and 9.1.) ► 

In many instances the inclusions in (7) and (8) may be replaced by equalities, or 
nearly so. That this is not always the case is shown by the third of the examples 
mentioned next. 



146 Duality and Exponential Families 

Example 93. If S is bounded, in particular finite, then bar C = = O"*"© and 

bar© = {0} = 0-'C ^ 

Example 9.4. Gamma family. The density of the gamma distribution with shape 
parameter X and scale parameter f is 


raw 


ix>Q) 


where (A, j 8 ) e (0, co)^. The family constituted by these distributions is regular with 
t = (x, In x) and 0 = (1 - \ A — 1 ) as a pair of minimal canonical variates. The 

sets © and C are as indicated in Figures 9.1 and 9.2. 


Here 


so that 



bar © = {6: 6^ > 0, 02 ^ 0 } 

0 +© = { 0:01 < 0,02 > 0 } 
bar C = {t: Tj < 0, > 0} u {0} 

0^C = {T:ri>0,t2<0} 


barC^O'^©, 


cl bar C= O'"® bar© = 0'"C 


► 


Example 9.5. Suppose ip is the family of those normal distributions on for 
which the coordinate random variables are independent and have variance 1 . 
This family is full exponential with the identity mapping as minimal canonical 
statistic, and the mean value 0 of the distribution in ip afford a minimal canonical 
parametrization of ip with © = In this case C = R5 and bar© = (0) 
O-^C = R\ bar C = {0}, 0+© = R>‘. ^ 

Corollary 9.1. Under the setup of Theorem 9.9, if dim C = k and intfbar Q ¥= 0 
then ord ip = fc. 


Proof. The condition dim C = k implies that ti , . . . , 4 are affinely independent, 
and intfbar C) ^ 0 and ( 8 ) together imply int © 7 ^ 0 . The result now follows 
from Corollary 8 . 1 . ^ 
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9.2 INDEPENDENCE AND EXPONENTIAL FAMILIES 

In this section ^ is not assumed to be linear, as is otherwise standard in this 
chapter. 

Theorem 9.10. Let Pq be a probability measure, let t be a statistic and let ^ be the 
exponential family generated by Pq and t. Furthermore, let {t^^\ . . , , t^^^) be a 
partition of t. 

If t^^\...,t^”^^ are independent under Pq then they are independent under every 
element of 

Proof ^ has the representation 

^ = a(0)e®-‘ 
dP 0 

and, due to the independence assumption, one has 

where 

Hence, if belongs to the cr-algebra generated by (j = 1, • ■ • , rri), 

m 

m 

= n ► 

Corollary 9.2. Let ^ be the exponential family generated by Pq andt, and let (t^^\ 
t ^^\ . . ., be a partition of t 

If t^^\..,, are conditionally independent given under Pq then they have the 
same property under every element of 

Proof For any value t^^\ the conditional distribution of . . . , given t^^^ and 
under Pq, belongs to the exponential family generated by Pq(' 1 1^^^) and {t^^\..., 

► 

Theorem 9.11. Suppose $ has open kernel, let tbea minimal canonical statistic for 
and let {t^^\..., be a partition of t. 

Ift ^^\ . . . , are uncorrelated under ^ then t ^^\ . . . , are independent under 

Proof It suffices to prove the result for m = 2 and, in view of Theorem 9.10, for © a 
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product set© = ©^^^ x ©^^%ith©^^^ and ©^^^ being regions and© containing 0. 
From (12) and (13) of Section 8.1 one obtains 




(0) = O, 


Bee, 


whence 

T<'W^^,O) = T<^\0), dee 

which in turn implies 

K(e) = Kie^^\o) + 

So, the Laplace transform c = expK: factorizes and and are therefore 
independent under Pq, and hence under ip. 

Lemma 9J, Let {P^icoeQ] bea parametrization of% let % be an element ofS^ and 
let 0 )^”*^) be a partition of co. 

If are L-independent then any minimal representation of the 

densities of ip with respect to % is of the form 

dP 

dn 

Proof This result is simple to obtain from the definition of L-independence and 
the affine independence of the coordinates of a minimal canonical statistic. ► 

The lemma is practical in a verification of the following assertion which is also 
simple to show. 

Let {PqiOeQ} be a minimal parametrization of let id^^\ , . . , and 
it^^\..., be similar partitions of 6 and t, and suppose ^ is full. Then t^^\..., 

are (stochastically) independent if and only if are L-independent. 

Thus L-independence of the components of a partition of a canonical 
parameter is a rather trivial phenomenon. It will appear from Section 10.2 that L- 
independence in exponential familiesis almost exclusively tied to L-independence 
of the mean value component and the canonical component of mixed para- 
metrizations. The next theorem contains a criterion for this latter property. 
Consider a minimal representation of ^ and suppose Sp is interior. 

Theorem 9J2. Iff^^ and 6^^^ are L4ndependent then considered as a function of 

0^^^), is of the form 

( 1 ) 

for some functions (p and ij/. 

The converse assertion is valid provided ^ is open and connected, and provided 
and 6^^^ are variation independent. 
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Proof. The first assertion follows from Lemma 9.1, the second from Theorem 10.4. 

► 

Corollary 93. Suppose ^ is regular. Then the following three conditions are 
equivalent. 

(i) and 9^^^ are L~independent 

(ii) 9^^^ is of the form 9^^^ = + ij/{9^^^) 

(iii) and considered as random variables on 0 , are stochastically 
independent under the conjugate family. 

Proof. The equivalence of (i) and (ii) is obvious from Theorem 9.12 and Theorem 
8.4. 

To show (i) => (iii) => (ii) note first that, since ^ is regular, the domain of 
variation of 9^^^) is of the form x Moreover, the density of 9^^\ 
with respect to Lebesgue measure on x 0 < 2 ) and under the element of the 
conjugate family given by (9) of Section 8.2, is 


(2) 




39 ^^^ 


If and 9^^^ are L-independent then, by Lemma 9.3, this density may be 
written 




dtp 


i.e. the density factorizes into a function of times a function of 9^^K Thus 
and 9^^^ are stochastically independent. 

On the other hand, stochastic independence of and 9^^^ implies that (2) 
factorizes and this has, as is simple to see, the consequence that 9^^^ is of the form 

6,(1) = + ^(0i2)y ^ 

Example 9.6. Let Xj and X 2 be independent and Poisson distributed with mean 
values and A 2 . Then t = (x., X 2 ) and 9 = (ln 2 j,ln{A 2 / 2 j}) is a corresponding 
pair of canonical variates, and = 1.. Now 


6,(i) = inx(i)^ln(l + e«-) 


and Corollary 9.3 yields the two well-known results that 2. and 22/^1 are L- 
independent and that if 2 i and 22 are stochastically independent and follow 
gamma-distributions with a common scale parameter, as is the case under 
then 2 and 22/2 1 are also stochastically independent. ► 

Example 9.7. Independence in the Wishart family. Suppose t is a FI^(/,L)- 
distributed variate so that the density of t is 


(3) 


w,(/)|A|'^''^|j|^^ ■■ exp{ -|tr(A0} 
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where A = (cf. Example 6.6). Partition t 

‘‘i 

\hi hij 

such that ti I and t 22 are square matrices and set = tn - Correspondingly one 
has T^'> = En, 0^'^ = An and 

/ 0 Ai2 

\ ^21 ^22 

The standard formula En = — ‘^i 2 d^ 22 ^A 2 i shows that condition (ii) of 

Corollary 9.3 is satisfied and hence En and (A 12 , A 22 ) are L-independent and, 
moreover, stochastically independent under 
It is evident from the expression (3) that the class of marginal distributions of A 

(r+ 1 \ 

under includes all the Wishart distributions on ^ ^ Thus the important 
and well-known fact that tn — ti 2 t 22 hi and (ti 2 > ^ 22 ) are stochastically 
independent is an immediate consequence of the above result. ► 


9.3 LIKELIHOOD FUNCTIONS FOR FULL 
EXPONENTIAL FAMILIES 

The present section is devoted to the study of likelihood functions for full 
exponential families, in particular the question of existence and uniqueness of 
maximum likelihood estimates. Nearly all such families met in statistics are 
regular, or at least steep. The most important part of the section is Corollary 9.6 
which summarizes the main results for regular and steep families. 

The log-likelihood function 

( 1 ) m^e^t^Kie) 

is, for any teR*, a closed concave function on R\ and 

where i{9) denotes Fisher’s information function. 

For a fixed r, consider the levels sets of /(•), i.e. the sets 

C,=={9:m>d}, deR, 

By Theorem 5.14, the non-empty sets among the C^, deR, are closed and convex, 
and they all have the same recession cone. Hence they are either all unbounded or 
all bounded, and the boundedness case occurs if and only if t e int C. To obtain the 
latter assertion, note that 

and invoke Theorems 9.10 and 5.20. 



Likelihood Functions for Full Exponential Families 151 

It is convenient to take the maximum likelihood estimator 9 to be the function 
defined for every teR^ a.s the set of points 6 which maximize I, rather than the 
restriction of this function to the support S of Note that with this general 
definition, 6 is the solution to the maximum likelihood estimation problem not 
only for ^ but for every (cf. Section 8.2(ii)). 

Theorem 9J3, The log4ikehood function I has a maximum if and only ifte int C, and 
then the maximum is unique. 

The maximum likelihood estimator 6 equals Dl, the gradient mapping of f, and 
the inverse mapping 9 ~ ^ equals dK. 

Proof Let BeR\tE R!". From the remark following immediately after equation (2) 
of Section 5.4 it is seen that /(•) = /(•;?) attains its supremum at 9 if and only if 
9edl{t}. But I is essentially smooth and thus dt equals the gradient mapping 
Dt of t which is single-valued and has domain mt(dom 1) = int C, cf. Theorems 
9.1 and 5.28. ^ 

Example 9.8. Let be a sample of gamma-distributed variates. Then 

1 ” 

t = - ^(Xi,lnx,) 

«£= 1 

is a minimal canonical statistic and the convex support of the distribution of t is as 
shown in Figure 9.2. Clearly, tebdC for n = 1 while for n> \ one has feint C 
with probability 1, i.e. for samples of size 2 or more the maximum likelihood 
estimate of the parameter (2, f) of the gamma-distribution exists with probability 
1 . ► 

Corollary 9.4. It is a necessary and sufficient condition for 9 to be defined with 
probability one that bd C, the boundary of C, has probability zero. 

Corollary 9.5. Let be an affine hypothesis of ip and consider a teR^. 

If the maximum likelihood estimate exists under ^ {i.e. ifte int C) then it also 
exists under ^o- 

Proof. Since ^ is full so is if is a minimal canonical statistic for 

to is an affine function of t. Now apply Lemma 5.1. ► 

Example 9.9. Suppose is an affine hypothesis of 9^^ of the form a)* A) g S 

X A} with OgE. (Such hypotheses were considered in Examples 8.11 and 8.12.) 
For simplicity, assume moreover that A contains the identity matrix. Let Oq 
denote the maximum likelihood estimator under the model It seems rather 
temping to conjecture that dom0o h^s probability either 1 or 0. 

That this is not the case is shown by the example r = 3, H = {0}, and A equal 
to the set of positive definite 3x3 matrices A whose diagonal elements Su satisfy 
^11 + ^22 ~ <^33 = 1. This example is due to Erlandsen (1975). 


F 
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However, if A is a cone and 2 = 4 is the set of variance matrices 
corresponding to then the conjecture is true. The latter requirements are 
equivalent to 2 being a cone and alHne. See Erlandsen (1975), and also Jensen 
(1975). The paper by Jensen is a deep and comprehensive study of the structure of 
this kind of model (with E = {0}) and of the distributions of the maximum 
likelihood estimators and likelihood ratio testors for such models. Of particular 
importance among these models are those for which is, in a suitable 
coordinate system for the observation vector, describable as a family of 
distributions of a set of independent normal vectors such that the set may be 
partitioned into subsets within each of which the vectors are identically 
distributed, while, otherwise, A varies freely. Most of the models which have 
practical interest are of this latter type. ^ 

It is immediate from (1) that the likelihood equation is 
Eet=t (0Gint0). 

If this has a solution, i.e. ifteX, then the solution is unique and, by (2) of Section 
9.1, equal to §(t). In the case X = int C (which is equivalent to k being steep) the 
maximum likelihood estimate can therefore, whenever it exists, be found as the 
(unique) solution to the likelihood equation. 

Theorem 9.14. The maximum likelihood estimator 9 is a one-to-one function if and 
only if K is steep. 

Proof 6 is one-to-one if and only if 5 k is single-valued and, on account of 
Theorem 5.28, this means steepness of k. ^ 

Theorem 9.14 and Corollary 9.4 imply that if k is steep and bdC has 
probability zero then 9 is sufficient. 

On the other hand, suppose 6 is a boundary point of 0 and that k is not steep at 
9. Then, by Theorem 5.26, 5k(0) (czintC) contains a non-empty cone, and § is 
constant on 5k(5). For /c = 1 , 5k(5) is a halfline and must contain infinitely many 
support points, so 9 is not sufficient. 

Example 9.10. Let 3E = 7 > 1 be a constant and let be the continuous-type 
probability measure on R with density 

xg(1, 00 ) 

(i.e. Pq is the so-called Pareto distribution with support (1, 00 ) and shape 
parameter ;}[;). As noted by Chentsov (1966), the exponential family generated by 
Po and t, where t{x) = x, has 


O = (-“Oo, 0 ] 
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so that the maximum likelihood estimation problem is not, as it were, solved by 
writing down the likelihood equation. 

in this case the cumulant transform k is strictly increasing and convex on 
(-~3O,0] with ?c(i9)i-oo and k{6)11 for k:(0) = 0 and k{0) = 

1 + (z - 1)"\ while k{9) = x for d>0. Clearly, k is not steep (and ^ is not 
regular). The conjugate k:* = / of ?c is x on ( — x, 1], strictly decreasing and 
convex on (1, 1 + (z — 1)~ with /(t) | x for t 1 1, 1(1 +(x- 1)~ = 0, 

t{l +(x — 1)"^) = 0and?(T) = Ofor0 > (1 — 1)”^). Figures 9.3 and 9.4 show 

the graphs of k and I 



Figure 9 3 Figure 9.4 


The maximum likelihood estimator § is defined with probability one but is not 
sufficient; indeed 6{t) = 0 for re[l +(x — l)‘'^ x). ► 

It is clear from the above discussion that steepness of k is a very essential 
property of a full exponential family. Notice however that, in the case k is steep, 
boundary points of © which belong to 0 do not occur among the values of 0 and 
are in this sense superfluous. But there are no such boundary points if © is open, 
i.e. if ^ is regular. The main conclusions about steep exponential families, which 
are contained in the previous discussion, may be summarized as follows. 

Corollary 9.6, Suppose ^ is steep, which is true in particular if is regular. The 
maximum likelihood estimate exists if and only if te int C, and then it is unique. 
Furthermore, ^ = int C and the maximum likelihood estimator 6 is the one-to-one 
mapping on X and onto int © whose inverse is t {where t{9) = E^t). 

Example 9.11. Let x^,...,x„ be a sample from the iV^(c, L)-distribution. By 
Corollary 9.6, the maximum likelihood estimate of (<J, L) exists if and only if 

t — ^ which happens precisely when S(x. — x').(x. — x) is positive 

definite. (It is straightforward to see that the latter condition is satisfied with 
probability 0 or 1 according as w < r or >r.) ► 

From Corollary 9.6 and the final remark in Section 8.2(i) one finds: 

Corollary 9J. Suppose ^ is regular, let be a linear subfamily of "ip, i.e. 
^0 = {P^: 06 ©o} where ©o = © n L and L is a linear subspace of R\ and let P 
denote the projection onto L. Then, under the model ‘ipo, the maximum likelihood 
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estimate 9 oit) exists {and is unique) if and only if the likelihood equation Pt == Pt(0) 
has a solution in Oq. ► 

Example 9.12. Let ^ be the model for an m-dimensional contingency table t of 

independent Poisson variates whose mean values vary freely, and let be a 
hierarchic submodel of Here Pt stands in one-to-one, affine correspondence 
with the, so-called, minimal set of fitted marginals, and thus the maximum 
likelihood estimate exists if and only if these estimates can indeed be fitted. (For 
details, see Andersen 1974.) ^ 

Example 9.13. Suppose x is r-dimensional and normally distributed and let be 
an affine hypothesis of 9^, of the form = {^(o.a)- A e A} and such that A is a 
cone and E = A, cf. Example 9.9. The likelihood equation under may be 
written Px'x = PE, but the relation S = A implies PE = E and hence 

E = Px'x. >- 

It is of some interest to explore the situation when bd C has positive probability 
(cf. Corollary 9.4). Thus, one may ask whether it is possible to enlarge the family ^ 
in a natural way such that the maximum likelihood estimator becomes defined 
with probability one. The discussion of this topic will be confined to the case 
where the support 5 of ^ consists of finitely many points. This case is perhaps the 
most important from a practical point of view; at the same time it is fairly simple 
and admits of a complete solution. 

The mean value parametrization of ^ establishes a one-to-one correspondence 
between int C and ^ and it follows from Theorem 8.3 that this correspondence is 
a homeomorphism (^ being endowed the weak topology). Let ij/ denote the 
mapping r A. It will now be shown, in a constructive way, that there exists a 
family of probability measures ^ on and a mapping ij/ on C and onto ^ such 
that$ a ^,\j/ is an extension of ij/ and if/ is n homeomorphism. Clearly, ip and ^ 
are uniquely determined by these properties. 

Let F be a proper face of C. Then F = conv Sp where is the set of those points 

of S which belong to F (Theorem 5.8). Thus, in particular, Po(F) > 0 and the 
conditional distribution Po{'\F) is well-defined. The exponential family 
generated by Po9\F) and the identity mapping t on has support Sp and convex 
support F. Moreover, is regular and hence the set of mean values of t under 
the probability measures in is equal to riF and is parametrized 
(homeomorphically) by the mean values. Note that 

^^^{P{-\F):Pe^} 

(cf Section 8.2(v)). 

Now, define ij/ as the mapping on C into the set of probability measures on R^ 
which coincides with ij/ on int C and which to any point t in bd C lets correspond 
the element of having mean value t, F being the uniquely determined face of C 
with re riF. The entities ip and ^ = ^(C) will be called the completion of ip and 
% respectively. 
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Theorem 9,15 Suppose that the support S of consists of only finitely many points. 
Let if denote the mean value parametrization of and let if and be the 
completions of if and 

Then fi is a homeomorphism on C onto “p (p being endowed with the weak 
topology). 

Proof, if is clearly one-to-one and the restriction ij/ of ip to intC is a 
homeomorphism. Therefore it suffices to show that if lo c: bd C and if 
Ti , . . . , . . . is a sequence of points in int C such that Tq then -> piig). 

This, nameiy, implies that ip is continuous which in turn implies continuity of ^ 
since C is compact. 

Let 9„ be the value of the canonical parameter for P which corresponds to 
(?2 = 1 , 2, . . .) and let F be the uniquely determined proper face of C which contains 
To in its relative interior. Using tq and the finiteness of S it is simple to see 
that Po„{F) 1 and hence, for any bounded continuous function / on 

(2) ^fdPe„ = J/dPJ-|F) + 0(1) 

(as n oo). In particular, therefore, the mean value of PoS‘\^ converges to tq and, 
in view of Theorem 8,3, this implies that PeJi-\F), which is an element of P^, 
converges (weakly) to the member of P^ having mean value To> i-^* to ipirg). 
Formula (2) shows that P^^ = i^(t„) must too converge to ip(To), ► 

{P^: T e C}, where P, = ip~ ^(r), will be called the mean value parametrization of 
p. The likelihood function corresponding to this parametrization and to an 
observation u where t may be any point in C, is the function with domain C and 
whose value at t g C is 


( 3 ) 


dPo 


(t) = 


dPoi-\F) ,^^ 
r /7P 

f^‘dPo{-\F) 


Here F is the uniquely determined face of C such that t e ri F while 0 is a point in 
such that the probability measure determined by the right-hand side has mean 
value T. The value (real) of the quantity on the right-hand side is the same for all 
such points 8. 

Theorem 9J6. Suppose S is finite and let P = {i^:TGC} be the mean value 
parametrization of the completion p of p. 

For each teS the likelihood function (5) is a continuous function of zeC and 
attains its maximum at exactly one poinU namely t = t. 

Proof. The continuity of the likelihood function follows from the continuity of ip 
and the fact that S is finite. 
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If T e C is such that the face F determined by t does not contain t then the value 
of the likelihood function is 0 and the maximum, which exists since C is compact, 
is not attained at t. Also, the maximum is not attained at t if t € rbd F since this 
would imply that the likelihood function of the family assumed its supremum 
which is impossible by Theorem 9.13. Thus the maximum point is to be sought 
among those x for which teriF or, equivalently, among the relatively interior 
points of the face Fq determined by t; i.e. one is faced with the maximum 
likelihood estimation problem for the family the solution of which is t = f. 

► 

The question of maximum likelihood estimation for an observation t on the 
boundary of C may also be looked at from a somewhat different angle. Although 
the maximum likelihood estimate of d does not exist it is still possible that certain 
subparameters have a maximum likelihood estimate according to the definition 
given in section 4.7{i), and, in fact, if d is the dimension of the proper face F of C for 
which teiiF then a (k — d)-dimensinal affine transformation of d is estimable. 
This follows from the above discussion. To illustrate, let t^^^) and {9^^\ 9^^^) be 
similar partitions of t and 9 into components of dimensions d and k — d. and 
suppose that the face F may be expressed as F = {teC: t^^^ = 0}, which can 
always be obtained by suitable choice of the exponential representation of 
Then the maximum likelihood estimate of 9^^^ exists and is the unique solution of 


(cf. formula 3). 

Example 9.14. Logistic dose-{binomial) response model In the statistical theory for 
analysis of bio-assays with quantal response a prominent role is played by the 
logistic response model Data corresponding to this model consist of a set of real 
numbers Xj < X 2 < • • • < Xj, which normally are the logarithms of the various 
doses applied in the experiment, and to each x^ is associated a pair of integers 
(«i, Oi) where nX > 0) is the number of trials performed with the fth dose while a,- is 
the number of positive outcomes among the Ui (in the present context death is 
often the positive outcome). The model describes the z = as inde- 

pendent observations, Ui following the binomial distribution with numbering 
parameter ni and probability parameter 


1 

“ 1 + 

where the parameters a and jS vary freely in R. (The graph of the function 

1 


X 


1 4. 


xeR 



Likelihood Functions for Full Exponential Families 157 
is the logistic curve). Hence, the probability of observing Uj is 

f| (1 + n jg I 

where 

d d 

s=^a,. w=^x.a,. 

(4) i = i 1 = 1 

The model is thus regular exponential of order 2 with Q = and 

d d 

C = ciconv{(s, w); s = Y, 0 < u, < n,A = 1 d}. 

1=1 1=1 

Let ^ denote the family of distributions of (s, w). 

From the results developed above it follows that the problem of estimating 
(a,/?) on the basis of observations is solvable if and only if 

(s, w)eX = intC and that the estimate may be obtained as the unique solution to 
the likelihood equations 

Z ^iPi = ^ 

(which have to be solved by numerical iteration). It follows furthermore that for 
the completion ^ endowed with the mean value parametrization the estimation 
problem is solvable for every (s, w) e C. 

Consider more closely the simplest case, namely the one for which d = 3, 
Xj = — 1, X 2 = 0, X 3 = 1 and Ui = 1 X 2 = = n. Here 

5 = + (22 + U 3 , w = — Ui 

and the convex support C has appearance as shown in Figure 9.5. 

Let (cr, cd) denote the mean value parameter for so that ^ may be written 
{P^^ o,):((j,a}) 6 C}. If, for instance, (c 7 ,n)) is a boundary point of C of the form 

(( 7 , 0 )) = ( 2 , 2 ) 
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where Ae( 0 ,/j) then, by definition. 

( 5 ) = = forw = s.s = 0 . 1 ...,«, 

while the probability is 0 when w ^ s. To show the validity of this formula, note 
first that the face F determined by the point (A, X) is given by F = C n 
{(s, w):s = w} and that s = w implies ^1 = ^2 = 0, a^, = 5 . The mean value 
parameter corresponding to a = 0 , jS = 0 is (cro^<^o) = and 

so 

P(.o,»o){S = 5 = W\F} = j”|2-" 

whence (5) follows. The elements of ^ corresponding to the extreme points of C 
are the one-point probability measures at these points. (Silverstone (1957) 
discussed another kind of compactification of this model). 

Note furthermore that if 0 < s = w < n, for example, then a + jS is estimable by 
the method of maximum likelihood, even though (a,^) is not. In the case 
n < s < 2n, the parameter a is estimable irrespectively of the value of w. Thus, for 
= 2, Ui = 0 , ^2 = 1, U3 = 2 one has a = 0, while $ does not exist. ► 


9.4 LIKELIHOOD FUNCTIONS FOR CONVEX 
EXPONENTIAL FAMILIES^ 

In the present section ^0 denotes a convex subfamily of the full exponential 
family ^ and 0o stands for the (convex) subset of 0 corresponding to 'iPo- Unless 
explicitely stated otherwise, ^0 is assumed to be of order k which is equivalent to 
int0o 7 ^ 0. Full exponential families are convex and results obtained here 
generalize some of those given in Sections 9.1 and 9.3. 

It will be convenient, and will cause no loss of generality, to think of 0o as given 
by a relation of the form 0 o = 0 n D where D is a convex subset of (necessarily 
of dimension k). Set %o = t(©o nint0)( = T(Z) nint0)) and, for teRK 

1^(8) = 1^(9; t) = e^t-K(e)- d{9\Dl e G R\ 
and 

lo{t) = sup /o( 0 ). 


^ This section treats a rather special topic. The results discussed are not used elsewhere in 
the book. 
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The function Iq is the log-likelihood function of When necessary, the 
dependence of on D will be indicated by writing foC-jD). Since 

(1) int ©0 = (int 0) n (int D ) 
one has 

lo{t\D) - sup{0 • t - k(6): 66©o} 

= sup{6 -t — K(0)E(int©)n(intD)} 

and consequently 

(2) loi]D) = ioi]dD). 

The following theorem is analogous to Theorem 9.1 and is similarly important 
in discussing maximum likelihood estimation. 

Theorem 9.17, One has 

(i) (k + <5(-|i)))* = to and Iq = k + ^(-IclZ)). 

(ii) K + d{‘\D) is a strictly convex function with dom {k + <5(-|D)) = ©o, and this 
function is closed if D is closed. 

(ii)* Iq is a closed and essentially smooth convex function with int C c: dom Iq. 

Proof. Assertions (i) and (ii) are immediate. That 4 is essentially smooth follows 
from (ii), formula (2), and Theorem 5.30, and the inclusion int C c dom Iq is a 
consequence of the inequality Iq < I and Theorem 9.1. ► 

By formula (3) of Section 5.3 

iQ^in5*{]D) 

and hence 

(3) dom Iq — dom / 4- bar D. 

Let teR!" and consider the level sets of lof; t) 

Cod = {e:lo(e;t)>df deR. 

If D is closed a statement completely analogous to that made at the beginning of 
Section 9.3 holds for the collection deR. Note also that, whether D is closed 
or not 

Cod = Z) n Cd 

(where Q = {0: /(•; t) > d}). 

As in Section 9.3, the maximum likelihood estimator is taken to be defined for 
every teR^. The maximum likelihood estimators under % and ^ will be denoted 
respectively by 6o and 9. 

For the next theorem, which generalizes Theorem 9.13, note that by Theorem 
5.24 and Example 5.6 one has 
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(4) d{K + <5(*|Z))) =:dKFK 
where K is the normal cone mapping for D. 

Theorem 9,18. The log -likelihood function Iq has a maximum if and only if te(dK 
+ K)(Qo), and then the maximum is unique. 

The mapping inverse to the maximum likelihood estimator Bq is dx + K. 

If D is closed then Bq equals DIq and dom Bq = int C + bar D. 

If D is open then 6 q equals the restriction of DIq to dK{&o){cz int Q, 

Proof. A point UeR^ is a maximum point for the log-likelihood function Iq 
corresponding to tsR^ if and only if te{dK + K){B\ cf. formulas (2) and (4) of 
Section 5.4, and in this case Bs&q since /q( 0) = — oo for B^&q. To see that the 
maximum is unique, note first that for every BeR^ 

(5) d{K 4- (5(-|Z)))(0) = 846) + d5{]D)ie) 

c 846) +|d(^(-|clZ>)(0) 

^8iKF8{]dD))i6). 

The assertion of uniqueness is equivalent to 

(6) 8(k + S(]D)){B) n d(K + S(*|Z)))(5) = 0 

for every pair 0, B with 6 ¥=B.ln view of (5) it suffices to verify (6) for D closed, and 
on this assumption k + (5(-|D) is closed. Application of Theorem 5.29 now yields 
the result. 

Obviously, them 00^ = ^^+^- 

If D is closed then DIq is the inverse of 8{k + <5(-|D)) = 0o Furthermore, by (3) 

dom Bq = int dom /q 

= int C + bar D. 

For D open one has k = d{’\D) and hence Bq ^ is the restriction of dx to D or, 
equivalently, to ©q. But 8x is the inverse of Dl and therefore the last statement of 
the theorem is true. ^ 

The likelihood equation is 

EeT=^t (0e©,). 

If this has a solution B, i.e. if then 9 -= Bo(t), because 

{dx + K){B) 3 dx{6) = x{9). 

In the case ^ is regular Theorem 9.18 specializes to 

Corollary 9.8. Suppose ^ is regular. The maximum likelihood estimate exists if and 
only if te{x F i^(0o). cind then it is unique. Furthermore, 6B^ =^x + K and range 
Bq = ©Q. 

If D is closed then dom Bq = int C + bar D. 

If D is open then Bq equals the restriction of 6 to Zq. 
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Consider the situation where ^ is regular and D has the simplest possible form, 
namely that of a half-space 

D = {0:0-c<d} 

where e denotes a unit vector. Furthermore, let t€dom6o\2o. Then the 
maximum likelihood estimate B = doit) belongs to & n {6: 9 • e = d} and 

t — T(d)e{le:A > 0}. 

B can therefore be found as follows. First one projects t in the direction — e onto 
the boundary of "Xq- This operation yields a uniquely determined point 
T€(bd 3^o) ^ is obtained as solution to the equation t( 0) = f. 

Example 9.15. Logistic dose-{binomial) response model with f > 0 or ^ > 0. The 
terminology employed is that introduced in Example 9.14. The family ^ is 
regular with Q = so that for any D, ©o = Z> = range 6q . Assume for simplicity 
that d = 3, Xi = — 1, X2 = 0, X3 = 1 and n^ — n 2 = = n. In this case the line 

{(a,j?): = 0} in © is mapped by the mean value mapping onto 

{(s, w): 0 < s < 3n, w = 0}. 

If D = {{a, p): p > 0} then, by Corollary 9.8, §0 is the restriction to %o of § ^i^d 
Xq is the part of int C contained in the open first quadrant cf. Figure 9.5. 
lfD=- {iaJ):p>0} then 

dom §0 =XoU {(s, w): 0 < s < 3n, w < 0} 

where Xq = {(s, w): (s, w) g int C, w > 0}, and if (s, w) is a point of (dom 9)nC with 
w < 0 then (s, w) gives rise to the same maximum likelihood estimate as does the 
point (s,0). ► 

Example 9.16. Doubly truncated normal distribution. Let a and b be real numbers 
with a < b. The normal distribution iV(c, 0*^) truncated to the interval {a, b) has 
density with respect to Lebesgue measure given by 


(7) 



-1 


a< X <b. 


Let Xi , X2 . . . . , x„{n > 1) be independent identically distributed random variates 
following the distribution (7) and consider the problem of estimating 

{£.(7^)gR X (0, 00) on the basis of an observation (xj x„). 

This problem is, of course, equivalent to that of estimating 



R X (—00,1) 


on the basis of the minimal sufficient statistic 


t = 
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Let Pg denote the probability measure of the marginal distribution of t and let iPo 
= {P„:6eQo}. One has 


where 


dPo 


46 )^^^ = {(^(b) - €>(«)} 






= m - myn - ;/( r r gj 


MlzizIzA 




X e 


The parametrization {P^: 06 0o} is minimal and may be extended, in unique 
manner, to a minimal parametrization {P^iSe©} of the canonical family 
generated by ^o- The convex support of ^ is the set 


C = {t: na< ti < nb.jtl < nt 2 ^ ^na)^ + — na)(na -h nb)} 

and, since C is bounded, © = ^ is therefore regular. 

Invoking Corollary 9.8 one may conclude that the maximum likelihood 
estimate of ((^, a^) exists if and only if t is contained in the open subset %o = t(©o) 
of int C. In order to get an impression of how 2o looks, the mapping t is studied 
next. 

For arbitrary 9e& 






r dx 

a 


and hence r is given by 


(8) 


t(e) = 


f 

a 




a 





That part of the boundary of Xq which lies in int C is the image under t of 
bd©o = { 6:62 = l}.For02 = 1 all three integrations in (8) can be performed and 
one obtains 


( 9 ) 


Ti(<9i,l) = n 




01 


L2 0^1 ^2 ^aOi 1 


( 10 ) 


gdSl _ ga8i 
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where for 0i = 0 the two right hand sides should be interpreted as the limiting 
values for 0i — » 0, i.e. 


T 2 ( 0 , 1) = n + ab + b^). 

The following limiting relations hold 

v{9i.l) -* n(a.ja^) as6i-* —co 

x(9i.l)-*n{b,jb^) as6i-»-+oo 

and. as is not difficult to see, 2:o is the region bounded by the two curves 

(11) {t{0i,1): -CO <01 < +00} 

( 12 ) {t:nt 2 = jtl,na< ti < nb}. 

The curves are tangent to each other at the endpoints n(a, jo^) and n(b. ^b^). To 
show this, form the difference quotient 

nljb^ - T2(0i.1) 
nb — Ti(0i, 1) 


which, in view of (9) and (10), may be written 

- 1} + 1) 

It is apparent from the latter expression that as 0i cx) the difference quotient 
tends to b which is equal to the slope of the tangent to (12) at n{b,^b^). Figure 9.6 
indicates the appearance of the curves (1 1) and (12) as well as the sets C and ’Zq. 

► 


t 

2 



Figure 9.6 

If bd C has positive probability then Oq will not in general, exist with 
probability one. However, by enlarging the family suitably, as was done for 
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full families in Section 9.3, it is possible to remedy the matter. Specifically, 
suppose S is finite and let denote the closure of in fhe weak topology. Then 
% c the completion of ^ and under the maximum likelihood estimate 
exists and is unique for every t e S. Space will not be taken here to provide a proof 
of this fact. 

Theorem 9.19 below, and also Theorem 9.32 and Corollary 9.11 in Section 
9.8(lx), contain necessary and sufficient conditions for a pair t, BeR^ to satisfy 
tedomd^ and B = B^it). These conditions are useful for determining whether a 
proposed 9 is in fact the maximum likelihood estimate corresponding to t. 

Theorem 9.19. Suppose k is steep and let teR^. BeR^. Then tedomdQ and 
B = 0o(t) if cind only if Be D nintQ and 

(13) (0~0)-(T(0)~-t)>O, 0E0O. 

Proof If tedomSo and B — 6o(t) then, by Theorem 9.17, 06D n int © and 

(14) teT(6) + Kid). 

From the definition of a normal cone it follows that (14) is equivalent to 
(0-0)-(t-T(0))<O, BeD. 
which clearly implies (13). 

On the other hand, suppose that 0 6 D n int 0 and that (13) holds. Let 0 be an 
arbitrary element of D. On account of the convexity of D one has 
6x = 9 + X{9 - 6)eD for every 2g(0, 1). If 1 is sufficiently small then 0;LGint © 
and hence 9^ e ©o and 

id, ~ B)(t(6) - t) > 0. 


Thus 


(9 - 9}(t(B) -t}>0 

and one may conclude that tez(9) + K(9) or, equivalently, tedomBo and 

B = Bo(t). ^ 


9.5 PROBABILITY FUNCTIONS FOR EXPONENTIAL FAMILIES 

It will be shown in this section that for exponential families the concepts of strong 
unimodality and universality are virtually equivalent. Moreover, certain mild 
regularity conditions will be specified which are sufficient to ensure that a 
strongly unimodal or universal family is strictly universal, and that unimodality 
of ip implies strong unimodality. 

The relation between unimodality and strong unimodality will be discussed 
first. 
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Note that since the densities of ^ are of the form 

the family ^ is strongly unimodal if just one element of ip is strongly unimodal. 

The analogous proposition for unimodality is not true, as is shown by the 
following example. 

Example 9.17, Let Pq be the probability measure on R having support S = [0, 1] 
and density 

b{t) = ^r^. tG[0,l], 

and let ^ be the corresponding full exponential family. Then 
(p{t) ~ 6t = j\n t — 9t Flnl. 

For 0=1 this function is not quasiconvex on S and hence is not unimodal. 

► 

However, one has 

Theorem 9.20. Suppose ^ is full and © = R^. 

Then "ip is unimodal if and only if it is strongly unimodal. 

Proof. The if assertion is trivial and for ^ of continuous type the converse follows 
immediately from Theorem 5.12. 

In the discrete case, suppose ip is not strongly unimodal. Then for some 0 q e 
Pq^ is not strongly unimodal. Without loss of generality it may be assumed that 
0Q = 0. Let cp = conv cp. Clearly, then, there must exist a point toeS for which 

(pih) > 0ito) 

and this, by Theorem 5.16, implies the existence of k + \ points Si + i in S 

and of /c + 1 non-negative scalars with “1 — = 1 such that 

^0 = Xl^l + Xk+ih + i 

and 

( 1 ) (pUo) > xMsi) + ••• + Xk+MSk + il 

Set 

So = {to.Si.,..,,Sk+i}, 

let (Pq be the function on R^ which coincides with (p on Sq and is +o:^ elsewhere, 
and set <^o = convcpQ. On account of (1) one has 

<Po{h) > <Po(^o)- 

Let L be a non- vertical supporting hyperplane to epi (po at {to. <^o(^o))- Such a 
hyperplane exists on account of Theorem 5.23 and it has the form 

L = {(tri): teR^.rjER., —co-t + rj = a} 
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for some ojeE!", aeK. Set F = Lnepi^o- ^ is a face of epi<^o and epi 
c?o = conv S'o where S'o consists of the points (s, <^(s)), s e So and the direction (0, 1), 
cf. Theorem 5.17. Therefore (cf. Rockafeilar 1970, Theorem 18.3) there exist 

points ti in So and non-negative scalars Ai Am with Ai + * • • + = 1 

such that 


and 


iti^(po{ti))eF. i = 1 m. 


{to^0o{to)) = * (poM) + • * • + (Poitm)l 

It follows that 


<;p(ro) - CO • To > a 

(p{ti) — CO • = a z = T m 

Iq = Xiti 4- . • . + Xmtm- 

Thus is not unimodal. ^ 

The conclusion of the theorem does not hold in general without the assump- 
tion that © = R^. Thus, for instance, the family of gamma-distributions with 
fixed shape parameter X. and A < 1 is regular and unimodal but not strongly 
unimodal (cf. Example 6.2). 

A number of examples of strongly unimodal exponential families have, in fact, 
already been given in Sections 6.2 and 6.3. 

Theorem 9JL The following conditions are equivalent. 

(i) ip is universal 

(ii) (p{t) < su|) {0 • t - (p*{6)} for tedom (p 

(iii) (p{t) = sup {6 ’t - (^*(0)} for tedomq) 

ds@ 

(iv) (p{t) = (conv (p) (t) for t e dom cp 

(v) On dom <50, cp coincides with a convex function on R^. 

Proof By definition, ip is universal if for every s> Q and every tedom^ there 
exists a 0 e 0 such that 


(1 4 e)a{6)b{t)Q^'^ > cz(0)ib(f)e^‘l teR^. 

This inequality may be written 

(p{t) < 0 . t - (0.r- (pit)) 4 (5, reR^ 
where 5 = ln(l + a) > 0. 

The equivalence between (i) and (ii) is now evident, and the other equivalences 
follow from the relation 


sup {0-t — (p*(0)} < (p*^t) < {com(p){t) < <p(t). 


► 


tedomcp. 
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Corollary 9.9- ^ be of c-discrete type. Then 'ip is universal if and only if it is 

strongly imimodal. 

Corollary 9 . 10 . Let ^ be of continuous type and suppose the set is convex. Then 
^ is universal if and only if it is strongly unimodal. 

Example 9.18. Let be the r x c contingency table of independent Poisson 
variates whose parameters vary freely. This model is universal and it follows at 
once from Corollaries 9.9 and 2. 1 that the conditional distribution of x^^ given 
the marginals (x ..x, ) is strongly unimodal. 

In particular, the conditional distribution under the hypothesis of no in- 
teraction, which has point probability 

ITx, ! IlXj! 
xjnx.j! * 

is strongly unimodal. (This includes, for c = 2. the multivariate hypergeometric 
distribution.) ► 

Theorem 9.22. Suppose ip is full and universal and that S is finite. 

Then range f = 5. i.e. ip is strictly universal. 

Proof. Set <p = conv (p. Since <p(s) = <p{s) for s e S. it suffices to show that to every 
teS there exists a 6eP'‘ for which 

9 -t - ^(t) = sup (d-u -^{u)){= *{6)) 

ueR^ 

or, equivalently, that dom 8^ :=> S. The latter relation is valid in consequence of 
Theorem 5.23. ► 

Theorem 9.23. Suppose ^ is full and c-discrete. Let cpbea closed conv ex function on 
R* such that S = n dom S cz dom d^, and range 8(p is open, and assume that (p 
and (f) coincide on S {whence ^ is strongly unimodal and regular). 

Then range f = 5, i.e. ^ is strictly universal. 

Proof. If teS then t e dom -cp and hence for some 9 g range dtp 

9 - t — 0{t) = sup (9 ^u — ^{u)) 

which implies tG range f, since 9e® on account of (5) of Section 9.1. ► 

Example 9.19. Let the distributions in where ^ is full, be one-dimensional and 
c-discrete and suppose that ‘ip is strongly unimodal, i.e. 

(2) bit - l)bit + 1) < b{t)^. teZ. 

If the inequality in (2) is strict for every teS the suppositions in Theorem 9.23 
can be fulfilled {0 may be chosen to be strictly convex whence range dcp is open). 

On the other hand, for the family of geometric distributions equality holds in 
(2). except for t = 0, and range f = {0} ^ S. ► 



168 Duality and Exponential Families 

In the case where ^ is full and of continuous type and where ^ is a closed 
convex function, the mode mapping i is clearly equal to the restriction of dcp^ to 
©; moreover, by Theorem 9.8, one has © = int dom tp^. In particular, f = dq)'^ if 
dom3<p* = 0 or, in other words, if range dcp is open, and then = d(p and 
range f = dom dcp 3 int C. Thus: 

Theorem 924. Suppose ^ is full of continuous type, and strongly unimodal with cp 
closed, {Hence "ip is regular.) 

If range dcp is open, which is the case in particular if cp is essentially strictly 
convex, then range f => int C. i.e. ^ is strictly universal 

The assumption of openness of range dcp is essential for the conclusion that 
range f 3 int C, as is apparent from: 

Example 9.20. If Pq is the Laplace distribution on R then 

0{t) = \t\ + ln2 

vmgtdcp = [ — LI] 
and 

0=(-Ll). 

But 

range f = {0}. 

because 

range f = dcp*{^) 

= {f: •<()(£) n(- 1.1) ^ 0 } 

= { 0 }. ► 

9.6 PLAUSIBILITY FUNCTIONS FOR FULL 
EXPONENTIAL FAMILIES 

In the present section it is assumed that sup, p{t ; 6) < oo for every de&. and that 
ip is full. The structure of the plausibility functions of ip. including properties of 
the maximum plausibility estimator, will be studied. To some extent the 
discussion parallels that of Section 9.3. 

The log-plausibility function 

Kie) = 9-t-cp*{e)-5ie\@) 

is, for any teR'^.a closed concave function on R*. 

For a fixed t, consider the level sets 

Ci — {Q: jc(6) > d}, deR, 

of Tt. These convex sets are bounded if teintC (cf. Theorem 5.20). 
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In particular, when the maximum plausibility estimate S{t) exists it forms a 
convex set, which is bounded if teintC. 

Recall that, whether ^ is exponential or not, the range of the maximum 
plausibility estimator is, as a rule and certainly for $ of finite discrete type, equal 
to the whole parameter domain. 

For ^ discrete the maximum plausibility estimate has, ordinarily, a non-empty 
interior, i.e. its dimension is k. However, the estimates d(t') and S{t”) correspond- 
ing to two different points f and t" have at most boundary points in common. 
This follows from an application of a general argument indicated in Section 2.2. 

Thus, if ^ is discrete, int ^(t) is non-empty for every teS and if 0 = d{S) then 
there exists a partition of © such that int^{s)c: for every seS. 

Adding the assumption that ^ is universal, whence 6 e 6{t) ote f(0), one obtains 
that the plausibility function may be written in the form 






0 G Ag^ S G S.- 


(Compare Figure 2.1.) 

Theorem 9,25. is of finite discrete type then the maximum plausibility estimate 
S{t) exists {and is a convex set) for every teS, and 8{S) = © = 


Proof Since S is finite, tc is a polyhedral concave function. Moreover, n is bounded 
above and hence it attains its supremum, cf. Rockafellar (1970), Corollary 27.3.2. 

► 

Example 921. Suppose ^ is discrete with S cz Z. and strongly unimodal, i.e. S is a 
set of consecutive integers and 


( 1 ) 


b{t — l)b{t + 1) < b(t)^, teZ. 


In the case ^ is the family of geometric distributions, ^(0) = © and d{t) = 0 for 
tGS\{0}( = {l,2,...}). 

However, if the inequality in (1) is strict then b{t — l)/b{t) is strictly increasing 
on S' + {0, 1] and 


e{t) = 


In 


bit - 1 ) 

b(t) 


,ln 


bit) 

bit + 1 ) 


teS, 


(where this interval is to .be interpreted respectively as ( — co,ln {fe(t)/d(t + 1)}] or 
[in {bit — l)/bit)}, oo) if t — 1 ^S or t + 1 ^S.) Furthermore, 0(S) = ©. To show 
this one may, for instance, use Theorems 6.2 and 6.1. ► 


Incidentally, the results mentioned in Examples 4.18 and 4.19 are extendable 
to distribution families of the type considered in the above example. 

Theorem 9,26, Suppose ^ is of continuous type and that q> is a closed and strictly 
convex function which is differentiable on int C. 

Then Bit) = D(pit)for tsint C. 
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Proof. By the remark following formula (2) of Section 5.4. the function 

(2) 0 * t - (p^O) 

has a maximum at a SeR*" if and only if Bedcp'^^it). But here cp** = cp and, 
provided feint C, one has dcpit) = D(p(t). Moreover, on account of Theorem 9.7, 
range dtp == dom dtp* == 0 and hence D(p(t)e@. Since D(p{t) is a maximum point 
of (2), it also maximizes n{9) -Q-t — ^*(0) — ^(0|©). ^ 

Finally, for the continuous-type case, a set of requirements can now be 
stipulated which are sufficient to ensure that the conditions (iv) and (v) of 
Theorem 4.7 are fulfilled when f, f, ij/ and i^jof those conditions are specified, in the 
present context, as and where these latter variables are 

components of similar partitions of f, f, 0, and 0. 

Theorem 927. Let be of continuous type.- 
If (p is closed and strictly convex, and differentiable on int C, and if dom cp = ini C 
then 6 =: Dtp = and 0 is a homeomorphism on int C onto 0. Moreover, and 
are variation independent, and and 0^^^ are variation independent. 

Proof. By Theorem 9.22, $ is universal and hence 0 = Theorem 9.8 shows 
that 0 = int dom and therefore a maximum of 7r(0) is also a maximum of 
0 • f — tp*(9). This latter function has a maximum only if f eint C and it now 
follows from Theorem 9.26 that 0 = Dtp. The remaining conclusions may be 
obtained from Theorems 5.33 and 5.34. 


9.7 PREDICTION FUNCTIONS FOR FULL 
EXPONENTIAL FAMILIES 

Suppose an observation t, with distribution 

a{9)b{t)Q^’\ 

has been taken, and that it is desired to make inference on the unknown outcome 
u of another, independent experiment for which u is assumed to follow an 
exponential family which is also of order k and has a minimal representation 

aH9)d(u)e^‘\ 

The support and convex support for u will be denoted by and while the 
variation domain and the value of the parameter 0 are supposed to be the same 
for t and u. In typical situations where this is the case, t and u are both minimal 
canonical statistics based on independent samples and x\,...,.xl 

respectively, the common distribution of the xs being from an exponential 
family (cf. Section 8.2(ii)). 

Furthermore, let the distributions of t and u be of finite discrete or c-discrete or 
continuous type, and consider the likelihood prediction function 
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L{u\t) = supa(0)6(t)e®-' 

sup{d(«)e®'“} 

u 

and the plausibility prediction function 

b{t) d{u) 

= sup . 

sup{h(t) e^’'} sup{d(w) e^'“} 

t u 

With the notations Z = In £, it = In 77, k = - In a, 9 = — In h and ij/ = —Ind one 
has 

( 1 ) t^t) = sup {6-it + u)-{k + tl/*)ie)} - (Pit) - ij/iu) 

06 © 

and 

( 2 ) n{u\t) = sup {0 ■ (t + u) - {(p* + il/*){d)} - (pit) - ij/iu). 

Obviously, d and 5 (the sets of points 0 6 0 for which the suprema in (1) and (2), 
respectively, are attained) depend on i and u only through t + u. 

Finally, assume that the distribution family of u, as well as that of t, is full. Then, 
in view of Theorem 9.2(ii), the suprema over 0, in (1) and (2), equal the suprema 
over all of R\ and (1) and (2) may be written 

T(u\t) = (tc + + u) — (p(t) — \l/{u) 

= (1 + u) - (Pit) -- il/(u) 

where 1=k^ (cf. Theorem 9.1), and 

7c{u\t) = (□* + ^*)*(t 4- m) — g){t) — \l/(u) 

= (<p** □ i/^**)(t + m) ~ <p(t) ~ \J/(u). 

In the discussion of I and n below the plausibility prediction function n, which is 
symmetrical in relation to the performed experiment and the predicted experi- 
ment, will be treated first. 



Theorem 9.28. Let the distribution families of t and ubeof finite discrete type and 
universal 

If t -hue range{t + u) then 

( 3 ) Kiu\t) = (pii) + ij/iS) - (pit) - ipiu). 


i.e. 

(4) 


n(u\t) = 


H¥i») 

miii) 


where, in (5) and i4), i and u denote mode points corresponding to a common value of 
9 and determined so that f + u = f + m. 
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For any value t the maximum plausibility predictate of u exists and is given bv 

(5) u it) = u{e{t)), 
and 

n{i\t) = 1 . 

Proof The functions (p = conv cp and fr = conv are closed, and they coincide 
with (p and if/ on {<p(*) > 0} and {ij/f) > 0}, respectively, cf. Theorem 9.21. If f g f(0) 
and ueu(d) then fedcp'^iO) and ued\j/^d\ and the first assertion of the theorem 
now follows from Corollary 5.2. 

As discussed in Section 3.2, one always has uid(t)) c U{t). Thus it remains to 
show that n{u\t) = 1 implies ueu(d(t)). 

Clearly, int(C 4- 0) c: domd(^ □ i/^) c: C + and in fact domd(^ □ i^) = 
C + because (puff is sl polyhedral convex function, cf. Rockafellar (1970), 
Corollaries 19.1.2 and 19.3.4 and Theorem 23.10. From (2) one finds 

d(t + u) = + il/*)*{t 4- w) = d(9[] ij/){t + u) 

and so, for teS, ueS^ and deS(t -P u), 

%|t)=: ?)il(u;5) = 1 

=> t e f(0) and u € u{§) 

=> M G u{S{t)). ^ 

When the distributions of t and u are of c-discrete or continuous type a similar 
line of reasoning holds under mild regularity assumptions. Thus, in the c-discrete 
case, if besides universality it is assumed that p = conv cp and ij/ = conv ij/ are 
closed then the first assertion of Theorem 9.28 stands again, and if, moreover, 
range d<p and range dij) are open then (5) is true, at least for feint C (but non- 
emptiness of u{d{t)) is not ensured). 

As usual, let t = Egt. By the same technique as that used for the proof of 
Theorem 9.28 one obtains: 

Theorem 9,29. Let the distributionfamily of t be steep, and let the distribution family 
of u be of finite discrete type and universal 
If t -y UG range {t + u) then 

(6) T{u\t) = /(t ) + m - 9(t) - m 

i.e. 

(7) L(u|t) = a(^)b(t) e"-‘ ^ 

where, in (6) and (7), i = t(0) and u e u0), the quantities i and u being determined 
such that i + u = t + u. 

For any value teixitC the maximum likelihood predictate exists and is given by 

3(f) = u{6{t)). 
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For binomial variates t and u. Theorems 9.28 and 9.29 are illustrated b> 
Example 3.1. 

Example 9.22. Suppose t is the number of successes in a fixed number m of 
independent Bernoulli trials, while u is the number of trials required to obtain a 
further n successes. Let n denote the success probability . 

If 0 < f < 01 and n > 1, the above conditions are satisfied and 



(a = n, n + 1,.,.) 


where t = [mn}, u = n + [(o — l)7r/(l — tt)} and tz is determined so that 
t + u — t + u. Furthermore, u(t) = u(d{t)). p. 


9.8 COMPLEMENTS 

(i) In relation to Theorem 9.3, the reader may wonder whether Z is convex for 
non-steep k. Although this may happen it is not to be expected generally. To see 
that, note first that range Bk = dom dl = int C. Hence, by Theorem 5.25, 
(int C)\3: = 5K(bd ©). Invoking Theorem 5.26 one finds that (int C)\Z is a union 
of half-lines with start point in bd 2, which makes it somewhat unlikely to have 
2 convex (provided, of course, that k is non-steep). 

Example 9.2i. Take 3E = and t equal to the identity mapping on R^. Let Pq be 
the probability measure having support {(£i,0):ti >0} j {(0, £ 2 ): £ 2 - 0}, g ing 
measure j to each of the two half-axes and being such that the truncation of Pq to 
{(£i,0):£i >0} is the exponential distribution with density e"'', while the 
truncation to {(0, £ 2 ): £2 ^ 0} has density proportional to (1 -I- rl)"*. Defining ip 
as the exponential family generated by Pq and £ one has 

© = {£>: 01 < 1, 02 < 0}, C = [0, oo)^ 

and 


ci(0i) 


:<(0i), 


^2(^2) 


: 4(^2) 


,Ci(0i) +C2(02) ■^^''‘"Ci(0i) +C2(02)' 

where q denotes the Laplace transform of the truncation of Pq to axis i, and 
Ki = In c,. Now, 


and using this it is simple to see that X is the region bounded by the two positive 
half-axes and the curve 


(AV(A + 1), k' 2 ( 0 )/(A +1)), 0 < 1 < 00 . 

This region is not convex. 

This example is due to Bradley Efron. 


► 
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(ii) The following example shows that the inclusion sign in Theorem 9.4 cannot 
in general be replaced by an equality. 

Example 9.24. Let /c = 2 and let Ui, Uj. • • ■ be a strictly increasing sequence of real 
numbers with Uj = 0 and a„-* oo. Furthermore, let Pq be the probability measure 
on defined by 

Po{{ - 1> 0)} = i, Po{(a„, 1)} = 2-"- \ n = 1, 2, . . . 

and let ^ be the canonical exponential family generated by Pq and the identity 
mapping t = (tj, t 2 ) on R^. We have 

convS = {(-1,0)} u {(ti,t 2 ): - 1 < ti < 0, 0 < < 1 + tj 

and ord = 2. 

It will now be shown that, provided a„ increases sufficiently fast with n, 
the point t = 0 = (0,0) belongs to domf even though O^convS. Note that 

p(0) = inf Po{e-t>0} =0 

SO that the relation OGdomf cannot be obtained by invoking Lemma 9.1. 

We have 

t (0) = sup(— k:( 0)) = sup sup( — K:(Ae)) 

^ e ;(>0 

and thus it must be shown that, with an appropriate choice of {a„}, 

(1) inf inf f e^^'^rfPo > 0. 

e A>0 J 

Suppose (1) does not hold. Then there exists a sequence of non-negative real 
numbers Aj, A 2 j • • • ^ sequence of unit vectors C 2 , . . . such that 

(2) j dPo-^O asz->oo. 

The integral in (2) is bounded below by Po{eiA > 0} and this probability is 
greater than or equal to j unless ea > 0 and 6^2 < 0 where ea and ^^2 are the 
coordinates of ei. Hence one may assume 

en > 0, €12 <0, I = 1, 2, ... . 

The sequence A^ cannot be bounded from above; if, namely, A^- < A for all i and 
some A < 00 then 


J Jk-r<0} 


> 1 , 0)1 
> e”^^ > 0 
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in contradiction to (2). It may therefore also be assumed that 

1 = 1 , 2 ,.... 

Let Hi denote the smallest n with Cj ■(«„, 1) > 0. Then 
(3) J e^'^'-'dPo ^ 

In view of (2) and the fact that en > 0 and ^ > 0 one can conclude that oo 
as i 00 . Furthermore 1) > e,- + 1) where e, is the unit vector in 

the fourth quadrant which is orthogonal to (a„ , 1), i.e. 

e,=«+ 1) 

Hence 


Setting, for instance, 


one has 


e.-(a„ + i,l) > 


^/K + 1 ) ■ 


2”^ 


« = 1 , 2 ,... 


and consequently the lowest bound in (3) tends to oo as i oo. This contradicts 
(2), and thus 0 e dom /, 


(iii) The first assertion of Theorem 9.13 can be extende4 as follows, to cover the 
case of exponential representations 

a{e)b{t)e^'' 

with © = {6: c{0) < oo}, but which are not necessarily minimal. 


Theorem 930. Suppose © = {9: c(d) < oo} and int © ^ 0. 

Then the log-likelihood function has a maximum if and only if feriC. 

Proof Set L = aff S, let denote the dimension of L and let to be a point in L. 
Furthermore, let M be an orthonormal matrix whose first l^^^ columns are vectors 
in the subspace Lq = L — to, set 

r=(t- to)M 
6 = 9M 

and partition f and 9 into and 0^^^) where t^^^and 0^^^ have dimension 
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Id^K Now, for any 6eR'‘. 

Je®-'iFo£ = e«-'“ Je«-'dPof 

which shows that 

(4) X 
and 

(5) k( 9) - 6' to = 

where is the domain of the Laplace transform Cj of and Ki = Inci. 
Note that, with ai{9^^^) = the expression 

( 6 ) §(»€©<!», 

is a (minimal) representation of the densities of ^ with respect to Pq. 

By (5) one has 

(7) l(9) = 9^t-K(9) 

9 ‘it - to) - Ki{9^^^) 

whence, in view of (4), one finds that / has a maximum if and only if = 0 and 

( 8 ) /#'>) = 

has a maximum. 

The function li given by (8) is the likelihood function for Sp corresponding to 
the representation (6). The assumption int©^ 0 implies int©^^^^0. 
Moreover the affine support of the marginal distributions of has 
dimension Hence, according to Theorem 9.13, li has a maximum if and only if 

belongs to the interior of the convex support of 
The latter condition together with the requirement = 0 is equivalent to 
teriC. ► 

(iv) Partially observed exponential situations. If a variate x follows an exponential 
model ^ with minimal representation 

a{9)b{x) 

but only the value of a statistic w, and not x itself, is observed then the likelihood 
equation for 9 is 

(9) 


Eet = Ee(t\u) 
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while the Fisher information matrix may be written as 

(10) m = VeEe{t\u)i = V^t - £,Fe(t|u)). 

This follows at once from formula (6) of Section 8.2. (The expressions (9) and (10) 
were noted by Per Martin-L5f in 1966. The asymptotic likelihood theory for the 
present type of situation is treated by Sundberg (1974) who also provides a 
copious list of examples of such situations.) 

Suppose now that an affine submodel = {Pe'SeSo} is considered. The 
likelihood equation may then be written 

(11) ^dih) = EQ{to\u) (06©o) 

where to denotes a minimal canonical statistic under • 

Example 9.25. ABO blood group system. The table below indicates the observed 
and theoretical distributions according to genotype at the ABOdocus for n 
persons sampled from a population in which the frequencies of the A, B and O 


genes are, respectively p. 

q and 

r. 




AA 

AO 

BB 

BO 

AB 

00 

^1 

>^2 

X3 

X4 

X5 

^6 


2pr 


2qr 

2pq 



Since A and B are both dominant relative to O, only Xi + ^2, X3 + X4, X5 and Xg 
are phenotypically observable. Taking the full multinomial model for (xi , . . . , Xg) 
as the original model and letting $0 correspond to the hypothesis of 
Hardy- Weinberg distribution, as in the table, one finds from (11) with 
u = (xi + X2, X3 + X4, X5) and to = (2xi + X2 + X5, 2x3 4- X4 + X5) that the li- 
kelihood equations are 




P 

p 


2nq = X3 + X4 4 - X5 4 - (X3 4 - X4) 


q 4 - 2r 


► 


The expression (11) was, for frequency table models and u being a vector of 
sums over various cells, derived by Haberman (1974) who illustrated it with 
various genetical examples including the one given here. Clearly, many models 
for frequency tables for which some of the observations are only partly 
categorized fall within the present framework. 

(v) Spread-stabilizing and normalizing transformations. Transformations of argu- 
ment variables of lods functions which normalize, or stabilize, the spread of, the 
functions were touched upon in Section 3.4(i). Here the question of transfor- 
mations of this kind will be further discussed in the one-dimensional case. 
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primarily for log-likelihood and log-probability functions of regular exponential 
families of order L 

Consider first the log-likelihood function of such a family, 



/(6>) = 6I£ - K{e), 

and set 


(12) 

^(0) = 1 K"(e)^de 

and 


(13) 

vie) = j K"{e)* de 


(the integrals being taken as indefinite). The transformation C stabilizes the 
spread of the log-likeiihood function in the sense that 


dC 


2(0 is the same 


for all t e int C, 


and the transformation v is normalizing in the sense that 

d^l 

(v) =: 0 for all 1 6 int C. 

dv^ 

This proposition may of course be verified by a direct check. It is however 
instructive to give the subsequent line of reasoning which leads to the result and 
shows that the transformations C and v are essentially unique. 

Let a(6) be a one-to-one, smooth transformation of 0. Differentiating and 
inserting 0 = 0(= 0(t)), where feint C, one obtains 



Requiring these expressions to be, respectively, constant and 0 identically for 
teintC — or, equivalently, for 0€© — one obtains differential equations for a, 
which are solved by (12) and (13). 

It is convenient to have the transformations expressed in terms of the mean 
value parameter t as well as in terms of 0. Writing Vq and for the variance of t 
considered as a function of respectively 0 and t, one sees that the appropriate 
transformations of these variables have the form 
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(14) 


(15) 


ae) = jvide 
C(t) = J v:Ut 

v(0) = | VI dd 
v(t) = J V-Ut. 


A familiar technique for finding a variance-stabilizing or normalizing transfor- 
mation of the arithmetic mean F = (^i H + t„)/n of n independent and 

identically distributed variates tj, i = proceeds as follows. Let t denote 

the mean value of the ti and suppose that t parametrizes the family of 
distributions of t,, so that the central moments ^ 2 , * of the are expressible 

as functions of t alone. Furthermore, let a(t) denote the transformation sought 
for. From the Taylor series of a(t) around t one obtains the first order terms of the 
expansions in powers of of the second and third central moments of Qc(i). 
Requiring the first order term for the second central moment to be constant yields 
the differential equation a' = jX 2 ^ for a variance-stabilizing transformation a, and 
setting the first order term for the third central moment equal to 0 gives the 
differential equation a"l(x' = — jUsHSfil) foJ^ ^ transformation a which can be 
expected to be normalizing since it makes the skewness of a (t) approximately 0. 

If the ti follows a linear and regular exponential family, one has t = k'{ 6\ 
fii = k"( 9 ) == and = k'"{ 9 ) and hence the solutions of the two differential 
equations may be written 


(16) 

at) = 1' v;Ux 

(17) 

v(i) = |V;^dT. 


That the spread-stabilizing and normalizing transformations for exponential 
models might be given on the form (14)-(1F) was first noticed by R. W. M. 
Wedderburn. 


Example 9.26. For the binomial distribution 

j'^|7r'(l-7r)'"-‘ 
the transformations turn out as 


= J {7c(l — 7i)} ^ d% — arc sin 
v(7r) = J {7c(l — n)Y^d% 



generally shows: 
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Theorem 9JL Let X be an open interval, let fhea strictly convex and three times 
differentiable function on X and define the function l{x) = l(x; x% for 
x*eX^ ^DfiX), by 



l(x) = X* - x — f(x). 

Setting 


(20) 

c(^) = jf'Xxydx 

and 


(21) 

v(x) = J /"(x)* dx 

and letting x denote the 
and V = v(:^) 

X solution of the equation x* = Dfix), one has {for 1 = ^{x) 


d^l - 

-^{Q is the same forallx*edi* 

and 

dH 

^(v) = 0 for all X* e X*. ► 

The theorem should be thought of as a result on spread-stabilization and 
normalization of linear, concave lods functions. 

It may be convenient to work with / as a function of = Df{x) rather than x. 
In terms of the transformations are 


c(i*) = 

and 




where /* is the conjugate of /. 

For the log-probability function of an exponential family, 

m(t) = et- (pit\ 
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Theorem 9.31 applies if cp coincides on S with a function ^ having the properties 
required of / in the theorem (in which case the exponential family is strongly 
unimodal). 

Example 9.29. Considering again the binomial distribution one finds, for 
(^(t) = ln{r(r+i)r(n-t+l)}, 

(22) at) = J mt + 1) + <A'(» - 1 + 1)}^ dt 

(23) v(r) = I + 1) + fin - t + 1)}^ dt 

where \j/ denotes the digamma function. The asymptotic expansion 

implies 

{{/'it + 1) 4- \l/'(n — t+ 1) n{t{n - 

and with this approximation formulas (22) and (23) change to (18) and (19). ► 

Example 9.30. For the gamma distribution with known shape parameter X> 1 
one has (pit) - -(2- l)lnt and formulas (20) and (21) yield the familiar 
transformations C(0 == In t and v{t) = which were obtained from a different 
angle in Example 9.28. 

(vi) Infinitesimal L-independence and orthogonality. In this subsection © is 
assumed to be open. 

Let {P^: CO eQ} be a parametrization of ^ such that Q is an open subset of 
and the (one-to-one) mapping co^O has continuous partial derivatives of the 
second order. Using formula (3) of Section 1.1 one finds that the log-likelihood 
function I satisfies 


dH ^ 39 d& d^e 

dco'do) do/ d(X>~^ ^ ^ dco'day ’ 

Hence, if tGT{&) then (by Corollary 9.6) one has the useful equality 

Considering now a partition ((^^^^...,€ 0 ^"*^) of co one sees, from (24), that 
...jCo^”*^ are orthogonal under P^ if and only if are 

infinitesimally L-independent at t, provided only that teT(0). Thus, on the 
assumption that ^ is full (and hence regular), co^^^, . . . , are orthogonal under 
P^ for every o; € O if and only if a )^^\ . , . , are infinitesimally L-independent 
for every t e for which the log-likelihood function has a maximum. 
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Next, the notions of orthogonality and infinitesimal L-independence will be 
considered in relation to canonical, mean value, and mixed parametrizations of 
the exponential family ‘ip. Let and be 

similar partitions of 6, t, and t. 

The Fisher information matrix for 6 is and that for t is From this 

and Theorem 9.1 1 it follows that 


\ . . . , are orthogonal 
<=> \ , 0 ^"*^ are orthogonal 

are uncorrelated 
are independent. 


Thus orthogonality and infinitesimal L-independence of . . . , or of 
z^'^\ are not properties that are of interest in themselves. 

However, with m = 2, one has the nontrivial result that the two components of 
the mixed parameter 6^^) are always orthogonal. This follows immediately 
from the formula 


(25) 


d^l 


= - z^^^y 


020 ( 1 ) 


In the two examples given below, and 0 ^^) are not only asymptotically 
independent, as implied by the orthogonality of and 0 ^^)^ t>ut are in fact exactly 
independent. 


Example 931. Let be independent and following the gamma distri- 

bution with probability function 


1 

T{x)P^ 


Q-X/P^ 


The pair (ju, X), where /x = is the mean value of the gamma distribution, 
constitutes a mixed parameter. By the above results, p and A are therefore 
orthogonal and infinitesimally L-independent. 

To prove the stronger assertion that the maximum likelihood estimates /2 and X 
are independent one may note that /2 and X are determined by the equations 


/2 = X 

il/{X) — Inl = ln{x/x) 

where ij/ denotes the digamma function and x is the geometric mean of Xi , . . . , x„ . 
Clearly, the distribution of x/x does not depend on P and hence, by Basu’s 
Theorem (Corollary 4.4), x and x/f are independent, which implies the inde- 
pendence of ft and X. This property of the gamma distribution was observed by 
Cox and Lewis (1966). ► 
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Example 9.32. For the family 9^" of inverse Gaussian distributions, having model 
function as (21) of Section 8.1, the parameter (/z,2) is mixed. Here 


ft = X 



That fi and I are indeed not only orthogonal but independent was shown by 
Tweedie (1957), who moreover established the remarkable result that n2/X follows 
a distribution with n - 1 degrees of freedom. Thus the inverse Gaussian family 
allows for an immediate analogue of the analysis of variance for nested 
classifications of normal variates. 

These results by Tweedie follow simply from formula (6) of Section 8.2. The 
conditional distribution of Z(l/x.) given x. is exponential and does not depend on 
a (cf. (22) of Section 8.1). To find this distribution it therefore suffices to determine 
the conditional Laplace transform 

where Eq denotes mean value with respect to the probability measure given by 

= 0, 2 = 1. Now, for a = 0 the distribution (22) of Section 8.1 is a stable 
distribution with characteristic exponent j and hence .x. is distributed according 
to (22) of Section 8.1 with a = 0 and 2 replaced by n^2. Consequently, by (6) of 
Section 8.2 and for 0 = -(2 — l)/2. 


or, equivalently, 

£o(e«"A-'|x) = (l 

Since the right hand side does not depend on x., the estimates (x and 2~ ^ must be 
independent; furthermore, the right hand side is the Laplace transform of the x^- 
distribution with n — 1 degrees of freedom, as was to be shown. ► 


Consider a family ip which is not necessarily exponential but yet partly 
exponential in the sense that its model function may be written as 

a{e^^W^)bix; 

i.e. for each fixed ^^< 2 ) is exponential with 9^^^ and as canonical quantities. 

Under smoothness assumptions, (25) generalizes to 


dH 






still being the mean value of The parameters and are consequently 

orthogonal (and often infinitesimally L-independent). 
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Example 9.33, The family of negative binomial distributions, having model 
function 


(1 - nY 


X+x-l\ 

X 


71 ^ 


is partly exponential and one sees that the mean p = — k) and the shape 

parameter y are orthogonal (as noted first by Anscombe (1950)). ► 


(m) Least squares and maximum likelihood. For an arbitrary vector variate t with 
mean value t = t(co) and variance V = V((o), where m is a parameter (coeQ 
cz the generalized weighted least squares estimate of co is defined as the 
solution to the equation 

(26) (t-T)K-*J^ = 0 

0(0 

(which, when r depends linearly on m, Q = RL" and Kis constant, equals that of the 
classical weighted least squares procedure of minimizing the quadratic form 
{t-x)V~\t-T)'). 

Suppose, in addition, that the family of probability measures governing t is 
exponential with t as minimal canonical statistic. Then the left hand side of (26) 
equals the derivative of the log-likelihood function, so (26) is, in fact, also the 
likelihood equation. 

In this connection it may be noted that the iterative (generalized) 
Gauss-Newton method for solving (26) is identical to Fisher’s scoring method. 

The above has been observed previously in somewhat less generality, see 
Bradley (1973) and Wedderburn (1974). 


(viii) Suppose "tp is regular with minimal standard representation 


dPo 




If 6 tends to the finite or infinite boundary of 0, the probability mass of Pe is swept 
towards the corresponding, i.e. the dual, part of the boundary of C. 

One of the possible precise versions of this statement is as follows. 

For every unit vector e in R^ and every d < ^*(c|C) one has 

' t < d} -> 0 for A t He) 

where A denotes a scalar variable and 


l{e) = sup {A; Ae g 0}. 

In fact, the stronger result 

r for At 1(e) 

J(e‘t<d} 
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holds for every w < 1(e). To see this note that for 2 > w 

r g w(rf-e.r) < Q^d^^Xe)Po{e . t < d}. 

J{e-t<d] 

That the right hand side tends to zero for X 1 1(e) follows from formula (5) of 
Section 7.1 if 1(e) = oo and from the first part of Theorem 7.1 otherwise. 

(lx) Theorem 9.19 is valid for arbitrary convex D. If D is a closed convex cone, 
the criterion of the theorem may be sharpened as follows. 


Theorem 932, Suppose that k is steep and that D is a closed convex cone. Let 
denote the negative of the polar of D and let te R\ Be R^. 

Then 


(27) 

if and only if 
{28i) 

(28ii) 

(28iii) 


redom^o, B = do(t) 

BeDnintS 

t(B)- teD^ 

B(t(B) ~ r) = 0. 


Proof Note that since k is steep 


(29) 


range Bq Dn int 0. 


Let / denote the negative of the log-likelihood function I for ^ corresponding 
to the observation t i.e. 


f(e) = -^1(9; t) = m - 6- 1 , 
/ is a closed convex function, f* = t(t + ‘) and 


deRK 


Hence 


int (dom /) n (int D) = (int ©) n (int D) # 0. 


inf{/(0): BeD} = -inf{/*(r*):t*6D*} 

where the infimum at the right hand side is attained, see Theorem 5.35. 

Now, suppose t e dom and B = 0o(4 and let t* be a point in R^ such that /* 
attains its infimum over D* at t*. Then 

/(0) = inf/= -inf f* = 

D D* 

and using Theorem 5.35 one obtains 

(30) t*G3/(0), BeD, t^^D"^, 0-t*=O. 
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By (29), 

(31) SeDnint©. 

Hence 5e int 0 and 5/(0) = x{B) — t, i.e. 

(32) t* = r(d) - t. 

Formulas (31) and (32) and the third and fourth assertion in (30) yield (28i“iii). 

Conversely, if (28i-iii) are satisfied then so is (30) with t* = t( 9) - u Invoking 
again Theorem 5.35 one finds that/(0) = inf{/(0): 6eD} which implies (27)^ 

(Conditions (28i-iii) are, in essence, the so-called Kuhn-Tucker conditions for 
the mathematical programme 

minimize /(0) = k{6) — 6 - 1 subject to 

cf. Rockafellar (1970)). 

Corollary 9.11. Suppose "iP. is regular and that D is of the form 

D = {0: 0i>O,...,0,.>O} 

where I <j < k. 

Then 

(33) te dom do, 9 = 9o{t) 
if and only if 9 and, with z = z{9), 

(34i) for i = 1,...,;, 

either 9^=0 and fj > t^ 
or 9i>0 and Zf — tp, 

(34ii) for i =j + l,...,/c, 

Zi = ti. 

Proof The assumptions of Theorem 9.32 are fulfilled and 
D* = {0: 01 > 0, , , , , 0j > 0, 0J + 1 = 0, . . . , 0fc = 0}. Thus (33) is equivalent to 

(35) 0G0, 01 >O,...,0,>O, 

(36) Ti > ti, . . . , > tj, Zj+i = tj+i,.. .,Zj^ = tf^, 

(37) 9i{zi — ti) + • • • 4- 9k(Zk — tfc) = 0* 

Clearly, (35)-(37) are equivalent to 0g © and (34i~ii). ► 

(x)The following comment is incidental to the remarks preceding Theorem 9.24. 
It is simple to see that if cp is closed convex and k = 1 then range t is convex (i.e. an 
interval). This, however, need not be so for k > 1. 
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Example 9.34. Let k = 2 and let 9 * be the closed convex function on 
determined by 

Ci>\ 

== j } — nr + ^ 0 , = {o^uooil o)2> -i 

4(0)2 + 1 ) 

where Cq is a constant to be chosen later. Set (p = Then is the conjugate 
of (p. 

is not afBne along any line and hence aff dom cp = R\ cf. Rockafellar (1970), 
Theorem 13.4. Moreover, Oe intdom<p*. Thus, by Theorem 6.1, the integral 

is finite. Let Cq be determined so that the value of the integral is 1, which is clearly 
possible. 

With this (p one has 

range i = dcp* (int dom 9 *) 

= D(p*{o): 0)2 > — 1} 

= {t: t2 = -t\]. 

Thus range ? is a parabola and is not convex. ^ 

(xi) To some extent the roles played by the mean value mapping t and the mode 
mapping t in, respectively, maximum likelihood and maximum plausibility 
estimation are analogous. Thus 0"^ = t provided ^ is regular, while f 

provided $ is universal. Moreover, with the same assumptions on % 0^^^) 

parametrizes and and are variation independent, while the same 
properties of 6^^^) occur in important cases, cf. Theorem 9.27 and the remark 
following Example 9.21. 


(xii) Factorial series families. Consider a (full) factorial series family Q of one- 
dimensional distributions, 

cf. Section 8.3(ii), and suppose S{ = {&(•) > 0}) is a set of consecutive integers. 
If 


h(x) = X + b(x)lb{x + 1) 

is a nondecreasing function on S then O is unimodal. 

The family Q is strongly unimodal if and only if b is strongly unimodal, i.e. 

b(x - l)h(x + 1) :S h(x)^ xeN, 
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(which, in turn, is equivalent to b{x)lb(x + 1) being nondecreasing on S) and in 
this case Q is universal. 

Furthermore, strong unimodality of Q implies that the maximum plausibility 
estimate of rj exists for every xeS and is given by 

fj(x) = {rjeNo: h(x - 1) <rj < h(x)}. 

Example 9.35. For the family of hypergeometric distributions 

h\l ^ \ llrj + m 
\xj \n — xj I \ n 

b{x)lb{x + 1) = (x + 1)( - - - 1 
\n — X 

and hence rj is determined by 

(m + 1) ~~ 1 <i] <{m+ 1) — ^ ^ - 1. 

n-x-fl ^n-x 

The maximum likelihood estimate is obtained from 

X ^ s, X 

m 1 < w < m . 

n — X n — X 

(If m members of a population of size m + ?/ are marked, and a random sample 
of n is taken from the population then (38) is the probability that the sample 
contains x unmarked members. The estimate m + ?/ of the total population size, 
or its approximation mn/{n — x), is known in the literature on capture-recapture 
investigations as the Petersen estimate, after Petersen (1896).) ► 

(xm) Discrimination information. The discrimination information between P and P, 
where P and P are any two probability measures on a sample space 3E, is the 
quantity I(P, P) defined by 

this quantity is also known as the information of P with respect to P, cf. Savage 
(1950) and Kullback (1959). By Jensen’s inequality (Section 5.5(m)), /(P, P) > 0 
with equality if and only if P = P, and /(P, P) may be thought of as a directed 
measure of the dissimilarity or distance between P and P. Furthermore, for a 
parametrized family ^ = {P^icoeQ] of probability measures, determined by a 
family {p(-; co): cue Q} of probability functions, one has, under mild smoothness 
assumptions that for fixed o) the matrix of second order partial derivatives of 
/(o), m), with respect to meO and evaluated at cu = o>, is equal to Fisher’s 
information matrix i{o)). 


(38) 

one has 
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For an exponential family '^5 with minimal representation 
dP 

— ^(x) = 

dfji 

the discrimination information between Pq and Pq may, if 0 e int ©, be written 
/(0,0) = (0-0 )*t-k:(0) +k(0) 

where t = t(0) = EqU Suppose is full, let re 2 (= T(int0)) and let 0 be the 
maximum likelihood estimate of 0 based on t. Then, in the notations of Section 
9.3, 

1(0,0) = ^t-K:(0)-(0-t-K(0)) 

= f(t) - l{0; tl 

which is a convex function of 0. Furthermore, if = {P0:0 g©o} is any 
subfamily of the maximum likelihood estimate Oq of 0 under the hypothesis 
can be obtained by minimizing 1(9,0) over 0 g ©o, and 7(0, 0o) = — Ini^ where q 
denotes the likelihood ratio test statistic of the hypothesis It follows that if 
is affine and if is an arbitrary subfamily of then 

m0oo) = meo) + /(0o,0oo), 

0'oo being the maximum likelihood estimate under ^oo- 

9.9 NOTES 

Except for Theorem 9.6 which is new and Theorem 9.7 which was given in 
Barndorff-Nielsen (1973b), the results in Sections 9.1, 9.3, and 9.4 are from 
Barndorff-Nielsen (1970). There is some overlap between Sections 9.1 and 9.3 on 
the one hand and Chapter 4 of the-independent-work by Chentsov (1972) on the 
other. Theorem 9,1 1 is due to Bildikar and Patii (1968). Sections 9.5 and 9.6 are 
based on material in Barndorff-Nielsen (1973b, 1976b), and Section 9.7 is based 
on Barndorff-Nielsen (1977b) and Mathiasen (1977). 

For asymptotic properties of exponential families, in particular concerning 
maximum likelihood estimates and maximum plausibility estimates, the reader is 
referred to Efron and Truax (1968), Andersen (1969), Berk (1972), Martin-L5f 
(1970), Hdglund (1974), and Mathiasen (1977). The paper by Hoglund is of special 
interest in the context of the present treatise as it shows, roughly speaking, that 
the distance between 0 and 0 will ordinarily be of the order of n" where n is the 
sample size; thus the distance is infinitesimal compared to the standard deviation 
of 0 (see also Section 10.7). 



CHAPTER 10 


Inferential Separation and 
Exponential Families 


The questions of the existence and character of the ancillary and sufficient 
statistics, including the cuts, under a given statistical model can to a large extent 
be settled by general theorems, provided the model is exponential. The present 
chapter demonstrates this. 

The same general assumptions on the exponential families and their exponen- 
tial representations as were made for the previous chapter (see p. 139) will be 
presupposed here. Moreover, and stand for similar 

partitions of 6, t, and t, into components of dimensions i = 1, 2, and ^ is, 
except in Section 10.1, assumed to have an open kernel. 


10.1 QUASI-ANCILLARITY AND EXPONENTIAL FAMILIES 

The concept of quasi-ancillarity, which was introduced in Section 4.5 and of 
which S- and M-anciilarity are special cases, is sufficiently strong to allow useful 
conclusions on existence and character of quasi-ancillary statistics to be drawn, 
when the family of distributions is exponential. Of the results given here the most 
important are Corollary 10.3, which states that if ^ has open kernel then the 
quasi-ancillary statistics are simply the statistics of the form and Theorem 
10.2, which, in particular, contains a condition for maximal quasi-ancillarity. 

A parameter function \j/, viewed as a function on ©, often depends only on 
some, r say, of the coordinates of 9. The number r will, in general, vary with the 
minimal canonical parametrization and the smallest possible r will be called the 
rank of ij/. A convenient, more formal definition of rank is the following. Iff is a 
mapping on a subset of ^ of then the rank of /is defined as the smallest integer 
m e {0, 1, . . . , /c] for which there exists an affine mapping g on into K”" and a 
mapping / from R"” such that 

fiy)=fidiy)l 

Now, recall the definition of quasi-ancillary statistics and note that if ^ has 
open kernel then every quasi-ancillary, as is simple to show. 

191 


H 
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Let u be quasi-ancillary. The size of u is defined as maxCord'tPo) where the 
maximum is taken over all members of the partition of 'jp induced by the 
conditional distributions given u. 

Suppose u is quasi-ancillary with respect to ij/. If a component is a function 
of u then + ranki/r < k, because xj/ depends on only through 
which has rank less than or equal to k — k^^K It follows, in particular, that 

(1) size w X rank < k. 

To see this, let t>e as above and such that ord = size u, and let to be a 
minimal canonical statistic for ^o* Since u is sufficient for the statistic to is a 
function of u and the previous inequality applies. 

Theorem 10 Let ubea statistic, let \j/bea parameter function and suppose that u 
is quashancillary with respect to if and that if parametrizes the conditional 
distributions given u. Let d and r denote the size of u and the rank of if, respectively. 

Then B-sufficiency of u is equivalent to d — k and also to r = 0, while B- 
ancillarity of u is equivalent to d — 0 and also tor ^ k. If none of these two cases 
occur then there exists a couple t = {t^^\P\ 0 = such that 

d < k^^^ <k — r and 

(i) is a function of u and u is conditionally B-ancillary given 

(ii) if stands in one-to-one correspondence with 

Proof Clearly, B-sufficiency implies = k and is implied by r = 0. Moreover, B- 
ancillarity implies r = k and is, on account of the quasi-ancillarity of u, implied by 
= 0. In view of (1) the first part of the theorem is thus established. 

Hence, suppose 0<d<k — r<k and let be such that is a function of u 
and such that there exists no other component for which is a function of u 
andfc^^^ > k^^\ 

By the remark preceding Theorem 10.1, one has k^^^ <k — r, and the inequality 
d < k^^^ will be established at the end of the proof. 

The conditional distribution given has a density which may be written in the 

following two ways 

(2) a{9^^Y'^) p{u; e^Y'^)pix; if\u), 

and if depends on 6^^^ only. Let x, x and 6^^^ be any pairs satisfying 
u{x) = m(x) = u and if = if = if. Fov the values, t and t, corresponding 
to X and x, one has and hence one obtains from (2) 

p(x, lf\u) ^ 0(2).(,(2)_7(2)) _ |f(2).(t(2)-7(2)) 

Pix; if \u) 

whence 

(0(2) _ ^2)^. ^(2) _ (0(2) _ 0(2)) ,J(2)^ 
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If 0^2) ^ ^ 2 ) would be in contradiction to the assumed ‘maximality’ oft* ^ and 

therefore ip is a one-to-one function of But this, in turn, means that the 
conditional distribution given depends on ip only so that, by the quasi- 
ancillarity of u, this latter statistic is conditionally B-ancillary given Thus (i) 
and (ii) have been proved, and it is now simple to see that d < ► 

Corollary 10 J. lfk=l then any quasi-ancillary statistic is either B-ancillary 
or B-sufficient, 

Corollary 10.2. If^ has open kernel then t^^^ is in one-to-one correspondence with 
u and d = k^^^ = k — r. 

Corollary 10.3. //^ has open kernel then the class of quasi-ancillary statistics 
equals the class of components t^^\ !► 

To any parameter function ip of rank r, for which 0 < r < k, there exist minimal 
canonical variates t = and 6 = such that k^^^ — k — r and \p 

depends on 9^^^ only. Under mild conditions, any statistic which is quasi-ancillary 
with respect to ip is an affine function of the t^^^ thus chosen, cf. Theorem 10.2 
below. However, t^^^ itself may not be quasi-ancillary. 

Example 10.1 . Let x be normally distributed with (^, a^) unknown and set ip = 
and t = {x,x^/2\ 0 = (c/cj^ - Here = x is not quasi-ancillary with 
respect to \p because \p is not a function of the conditioning mapping 
which in this case is constant, F^'^ assigning probability 1 to a single point. > 

Essentially, the difficulty, exposed by this example, is that 9^^^ need not 
parametrize the conditional distributions given the statistic t^^K This, however, 
can happen only if the conditional distribution of t^^^ given t^^^ is singular, cf. 
Lemma 8.4. 

Lemma 10.1. Suppose ^ is open and convex. Let ipbea parameter function of rank r 
and let t = and i = be two minimal canonical statistics such 

that the dimensions of t^^^ and t^^^ are, respectively, k^^^ and k — r. Suppose that ip 
considered as a function of9 = i9^^\9^^^) depends on 9^^^ only, and considered as a 
function of 9 = {9^^\9^^^) depends on 9^^^ only. 

Then k^^^ <k — r,t^^'^ is an affine transformation of and 9^^^ is an affine 
transformation of 9^^\ Furthermore, if k^^^ =^k — r then these transformations are 
regular. 

Proof. According to Lemma 8.1 there exist two constant k x k matrices A and A 
and two constant 1 x k vectors B and B such that 

t = tA + B 
9 = 9A + B 


where 

( 3 ) 


A'A = I,. 
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Let 

All Ai2 

A 

A 21 A 22 

be the partition of A such that An is a x{k — r) matrix. Since 

(4) 0^^^ = 0^^>Ai2 + 0^^^A22 + 

it suffices, in view of (3), to prove that A 12 = 0. 

Let the vector (0, . . . , 0, 1, 0, . . . , 0) in with 1 as the fth coordinate be denoted 
by Cf and let E = span %i)}, the subspace of spanned by {c . . . , 
Furthermore, let 

E = + d^^^A22 = 0}. 

Clearly £ is a subspace of dimension k ~r and if 0 and Oq are elements in © such 
that 9-doeE then, by (4), S^^\e)= 

Assume A^ ¥= 0 and note that this implies that E is not a subspace of E. If 

F = span (£, E) 

one can conclude that m = dimF — d> 1. Let 9oe& and let 

<5>o = (60 + F) n 0. 

By assumption © is open and convex and hence it is possible to connect every 
point in ©0 with 9o using a finite number of line segments contained in ©0 and of 
one of the two forms 

oc9 + {1 — a)9\ where 0 — 9' eE 
and 

a0 + (1 — a)9", where 0 — 9" eS. 

Since 

9 — 9' eEo 9^^^ = 9'^^^ 
and 

9 -ff'eEo9^^\9) = S^^\9") 

one has 

m = m 


respectively 


m = m'l 
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It follows that 

ij/i6) = il/ido) for all 6 e ©q. 

Consequently 

rank ij/ < k ~ dim ajff ©o = r — m 

in contradiction to the fact that rank ij/— r. ► 

It may be noticed that if r — m = 0 then the assumption that ip is convex can be 
replaced by the weaker assumption that ^ is connected. Hence, if ip is open and 
connected and r = 1 the conclusions in Lemma 10.1 hold. 

From the discussion in connection with Example 10.1, and from Corollary 10.3 
and Lemmas 10.1 and 8.3, one obtains: 

Theorem 10.2. Let \jj be a parameter function of rank r, where 0 < r < k, and let 
t = (d‘^\ t^^\ 9 — be such that k^^^ — k ~ r and ij/ depends on 0^^’ only. 

If 6^^^ parametrizes the conditional distributions of t^^^ given t^^^ — which is the 
case, in particular, if the conditional distributions are non-singular — then 
is quasLancillary with respect to ij/. 

Suppose either (a) ip has open kernel and iJ/ is one-to-one as a function of 9^^\’ or 
(b) ip is open and convex. Then any statistic, which is quasi-ancillary with respect to 
ij/, is (equivalent to) an affine function of t^^K Hence, if t^^^ itself is quasi-ancillary 
with respect to ij/ then it is maximal quasi-ancillary. In the opposite case there exists 
no statistic, quasi-ancillary with respect to xj/, whose size equals k — r. 

Note that if r = 1 then the assumption in (b) that ^ is convex may be weakened 
to ip being connected, cf. the remark immediately after Lemma 10.1. 

Example 10.2. Consider the regular family ip of trinomial distributions with 
model function 


nl 


Xilx2l{n - Xi - X2)l 


pvpni -Pi- Pif 


and let the interest parameter be 


y/ — . 

Pi +P2 

Taking = Xi + X 2 and 0^^^ = ln(p 2 /pi) one sees that Xi -h X 2 is quasi- 
ancillary with respect to ij/ and is the only statistic having this property. (The 
statistic x^ + X 2 is, in fact, both S- and M-ancillary.) 

Example 10.3. Let Xi and X 2 be independent, Poisson distributed with mean 
values / j and ^ 2 , and assume that A 2 ^ .The statistic x^ is the unique statistic 
quasi-ancillary with respect to 1.. (However, x. is not ancillary with respect to 
^ 2 , cf. Example 4.17). > 
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Example 10.4. Again, let Xj and X 2 be independent Poisson variates, but suppose 
now that / = (Ai,A 2 ) varies in fO, 00 ) x (0, oc') and let 



if A ^ — >^2 
if A I A2-> 


i.e. if/ indicates the hypothesis that = /I 2 . Then x. is the only statistic quasi- 
ancillary with respect to ip. (In fact, x. is both S- and M-an ciliary.) ^ 


Theorem 10.2 will be further drawn upon and exemplified in Sections 10.4 and 
10.5. 

In order to keep the theoretical developments, to be discussed in the rest of this 
chapter, simple while still covering most cases of interest it will be assumed 
henceforth that ^ has open kernel and only statistics of the type t^^^ will be 
considered. On account of Corollaries 10.2 and 10.3 little or nothing is lost by the 
second restriction. 


10.2 CUTS IN GENERAL EXPONENTIAL FAMILIES 
Certain conditions for a component to be a cut will be given here. One of these 

suggests a way of constructing exponential families with cuts, and this is discussed 
at the end of the section. For families of discrete type it is possible to obtain a 
considerable body of further results, to be described in the next section. 

Lemma 10.2, Let t^^^ be a cut of size d and let and be a corresponding pair of 
L-independent parameters. Suppose that ip has open kernel 

Then and are in one-to-one correspondence if and only if d — 

Proof The only if assertion is straightforward to derive from the definition of size. 
The converse assertion is a consequence of Theorem 10.1, Corollary 10.2, and 
Lemma 8.3(v). ► 

Theorem 10.3. Let ip be interior (i.e, if {P^:9e&} is a minimal parametrization 
of the exponential family $ generated by ^ and if 0 is the subject of 0 corre- 
sponding to ip then 0 c: int ©), with open kernel, and suppose t^^^ is a cut of size 
k^^\ 

Then and is a corresponding pair of L-independent parameters, and 6^^^ 

and a are of the form 

(1) = (piz^^^) + 

( 2 ) = 

Furthermore, the quantity 

does not depend on 

Proof. That and 9^^^ is a corresponding pair of L-independent parameters 
follows at once from Lemma 10.2. 
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and using the affine independence of the coordinates of one obtains ( 1 ) and ( 2 |. 
The last assertion is an expression of the fact that the marginal distribution of 
does not depend on ^ 

Let be the Laplace transform of the conditional distribution of given 

under Pq. Thus 


Furthermore, let denote the set of possible values of 

(3) for some 

and let t\}^ be a point in From the last conclusion of Theorem 10.3 it follows 
simply that for r}(6^^^) = ;^( 0 ) ~ one has 

(4) = c(d^^Yo^^) 
or, equivalently, 


= Kie^Yo^) + 


where K:(-|ri^^) = lnc(-|ri^^). 


Suppose is an open subset of The conditional cumulant transform 
K:(-|ri^^) is differentiable in and its gradient ^qtials the conditional 

mean value of under P(t(i), 0 ( 2 )). The vector function ri must therefore be 
differentiable too and 

(5) = ziS^Yo') + (f*'* - f'o 

Hence has linear regression on This property is thus a necessary condition 
for to be a cut, but it is not very useful as stated. However, it is possible under 
further mild assumptions to derive from it a more immediately applicable 
condition which relates to the boundary of the convex hull of S. For ^ of 
discrete type this possibility is indicated in the next section. 

Theorem 10.3 implies, in particular, that if ^ is open and t^^Ms a cut of size 
then and are variation independent and 0 ^^^ may be written as 
(p(F^^) + The converse assertion holds provided ^ is not only open but 

also connected, as will now be verified. 
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Lemma 10 J. Suppose ^ is open and connected. If 

(i) and are variation independent 

(ii) is of the form 

then (p is homeomorphic and, for certain functions ai and a 2 , 
a(z^^\ 0(2)) _ ai(z^^^)a2{d^^'^). 

Proof. Without loss of generality one may assume that %(0) = 0. Then 

and cp maps onto &^\0) = in one-to-one manner. 

Now, let be a sequence of elements in such that e One has 

=> (p{ff^) -> (p{f^^) 

and thus (p is continuous. Since X^^^ must be open, it may be concluded that cpisei 
homeomorphism which maps X^^^ onto ©^^^(0). 

Because of (i) and (ii), 

0(i)(^(i)^0(2)) ^ ^(0(2)) ^ ^(^(i))e©(i)(0) 

where Consequently 

(6) 0<2)) ^ ^-1(0(1) _ ^(0(2)))^ 

Let 

and consider for fixed the function / defined on 0^\O) by 
fid^^^) = k{9^^\ 0) - Ki9^^^ + x{d^\ G 

Using (6) one finds the gradient of / to be 0 because 

0 ) - 

= 0 ) - 

= ,p- 1(0(1)) _(p- 1(0(1)) = 0. 
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Since © is assumed to be open, ©^^^(0) is open. Moreover, ©^^^(0) is connected, 
because (p is a homeomorphism which maps onto ©^^^(0) and because the 
connectedness of © and the variation independence of and 6^^^ imply that T ^ ’ 
is connected. Let 0) "" t\) \ Since Df =0 one has 

/(0^^>)=/(O) y 9^^^ 

i.e. 

K{e^^\o) - k(9^^^ -f-;^(0<">),0<2))= k(o,o) - 
and, considering k: as a function of one obtains 

Thus, letting 

= K(r^^\ 0) and K 2 {d^^) = K{r\)\ 9^^% 

one has 


K(z^^\ 9^^^) = + K2(9^^^) 

from which the conclusion of the lemma follows immediately. ► 

Theorem 10 A. Suppose ^ is open and connected. Then t^^^ is a proper cut of size 
d = if and only if 

(i) and 9^^^ are variation independent. 

(ii) is of the form 

0 ( 1 ) = + x{0(^>y 

Proof. The necessity of this assertion follows from Theorem 10.3. 

Suppose (i) and (ii) are fulfilled. By Lemma 10.3, ^ has a minimal repre- 
sentation of the form 


dPo 




2))).r< 1 ) + 0( 2) .r( 2) 


The distribution of has density 

and as the next step it will be shown that this density does not depend on 9^^K This 
means proving that 

does not depend on 9^^\ Let ©^^^ denote the domain of variation for 9^^\ let 0o^^ be 
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an arbitrary but fixed element in and let 






With this notation 

is the density of corresponding to the measure in ip given by For 

every x one obtains 

that is 

J = 1. 

Since is complete, 

0 ( 2 ))= 2 

and hence 0^^^) does not depend on 9^^\ 

One may now conclude that is a cut with and 0^^^ as a corresponding 
pair of L-independent parameters, and on account of Lemma 10.2 one has 
==d. ► 

Regular exponential families are open and connected and, since for any mixed 
parameter corresponding to a regular exponential family the com- 

ponents and 0^^^ are variation independent (cf. Theorem 8.4), one has the 
following: 

Corollary 10 A, Suppose ^ is regular. Then is a cut of size d—k^^^ if and only if 

0^^^ is of the form 

0^^> = 

Applications of Corollary 10.4 have been given in Section 9.2. As a further, 
immediate, consequence one has, provided $ is regular, that 

+ h t^^\x„) is a cut with respect to = 1, 2, ... if and only if this is true 

for one n. 

Next, a method for construction of exponential families with cuts, suggested by 
formula (4), will be mentioned. 

The discussion will, for the sake of simple argumentation, be confined to the 
case where ^ is regular and /c = 2, and where either is of the form {0, 1, 2, . . . , n} 
for some positive integer n or is equal to Nq . But it will be obvious that the method 
indicated is applicable more generally. 
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Suppose, for the moment, that is a cut and that formula (4) holds. Taking r{)‘> 

to be 0 one sees that is of the form 

(7) c(0<2)|t(i>) = 

where Co(-) is a Laplace transform. Presumably, in most cases of interest {(■) will 
also be a Laplace transform. In that event, (7) shows that, under Pq and for given 
the variate t^^^ has the form 

(8) zq + Zj + • • • + Zf(i) 

Zo. 2i , Z 2 , . . . being independent random variables and Zi , Z 2 , . . . having identical 
distributions. 

This leads to the idea of constructing a family ip in the following way. Take an 
arbitrary regular exponential family, of order one and with support to be 
the family of marginal distributions of Let 

(9) ni(a)hi(t<i')e“"‘’, xbA 

be the densities of this family, assume that Oe^ and denote by Cj the Laplace 
transform of the distribution of for (Z = 0. Furthermore, take any two Laplace 

transforms (of one-dimensional distributions) Cg and let B be the set 
< 00 and m < co} and suppose B is open. Define ^ as the family 




of distributions of t determined by the specification that under the 
distribution of is given by (9) while the conditional distribution of is 
determined by (8) together with the requirement that the Laplace transform of Zq 
is Co(‘ and that the common distribution of ZpZ 2 , ... has Laplace 

transform C(- + 

Clearly, is a cut relative to Moreover — as desired — ^ is exponential of 

order 2 and regular, with t as minimal canonical statistic. 

To see this, let $ be the full exponential family in generated by Pq (the 
element of and t, i.e. $ is the set of probability measures which are 
equivalent to Pq and whose densities with respect to this measure are of the 
form a{9) exp {9 • t). Now, by the construction of its Laplace transform 
^j(-) may be calculated thus 


(10) C(2,/?)(0) — 


= £, 




ad), CD's 


UP) \ m 


+ p) c,[9^^^+]nm^^+p)m } + «] 

UP) Cl (a) 
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Hence, in particular, the Laplace transform c of Pq is given by 

(11) c(0) = Co(0'^Vi{e^^>-f lnC(0^^>)}. 

On comparing (10) and (11) one sees that is equal to the element of $ 
determined by the value (a — In jS) of the canonical parameter 6. Thus ^ c 

and it is now a simple matter to show that, in fact, ^ is regular and “ip = ^. 

Example 10.5. Suppose follows a negative binomial distribution with point 
probabilities 


X 


-h t^^^ - 1\ 
) 




X being fixed and n varying in (0, 1). Let Zq, Zi , Z 2 , . . . be independent and such that 
Zq has a gamma distribution with density 



- 1 


while Zi,Z 2 ,... follow the exponential distribution 

(5e“^^, 


the parameter <5 having (0, oo) as domain of variation. Set 

t^^^ Zq + Zj + •••-{- 

Then t^^^ is a cut in the exponential family of joint distributions of and 
with 7t and as a corresponding pair of L-independent parameters. 

It may, incidentally, be noted that this example has the property that is also 
a cut (1 -n)S and n5 being L-independent. In fact, the marginal distribution of 
is the gamma distribution. 


{(1 

m ^ 


p, -{1 

c , 


and the conditional distribution of given t^^^ is Poisson with mean value nSt^^K 

► 


10.3 CUTS IN DISCRETE-TYPE EXPONENTIAL FAMILIES 

It is assumed throughout this section not only that the elements oTip are discrete- 
type distributions but also that ^ is full and linear. Thus x is minimal canonical. 

Instances of cuts in such families were, in effect, given in Examples 3.2-3.6. 
These examples concerned the multinomial, multivariate Poisson, and negative 
multinomial families, all of which belong to the class of sum-symmetric power 
series families; this class is investigated separately in the last part of the present 
section. 

The first part is devoted to the derivation of two necessary conditions for t^^Ho 
be a cut. These drastically limit the class of possible cuts. In particular, if X is a 
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subset of A^o containing all the points such that £; is either 0 or 1 

(i = 1, . . . , /c), then it appears, from the two conditions mentioned, that the only 
possible cuts are of the form 



being disjoint subsets of {1,2,. Basic special cases of (1) are 
= x.( = Xj + . . . + Xj), and = x'** (for some partition (x‘^’, x*^’) of x). 

Theorem 10.5. Suppose ip is regular. Let S'"' be the set of points in S having first 
component equal to a fixed value t^^K 

If there exists a point in the interior of the convex hull of S which has t<^> as first 
component and which does not belong to the convex hull of S'"’ then the statistic 

is not a cut. 

Proof. The family ^ has an exponential representation of the form 

^ = u(d)e«.' 

For any teR^, let the function /(0) = 6 - t + lna{d) be termed the log-likelihood 
function corresponding to t, even if t is not in S (just as in Chapter 9). 

Now, let t = {t^^\ f be the point whose existence is hypothesized in the 
theorem. Since i belongs to the interior of the convex hull of S the log-likelihood 
function corresponding to t has a maximum (cf. Theorem 9.13), i.e. there exists a 
value 6^^^) of the mixed parameter such that, considering / as a function 
of one has 

( 2 ) 

for every 

Suppose t^^^ is a cut. Then and 6^^^ are variation independent and 
(3) 

where li is the log-likelihood function for the marginal distribution of t^ while k 
is the log-likelihood function for the conditional distribution of t^^^ given t^^K 
Formulas (2) and (3) imply that I 2 has a maximum as varies over 
Moreover, 0^^^ is open. 

Now, for any exponential family whose canonical parameter domain is open it 
is — whether the exponential representation considered is minimal or not — a 
necessary condition for the log-likelihood function to have a maximum that the 
value of the canonical statistic belongs to the relative interior of the convex 
support of the family. This is a simple consequence of the first assertion in 
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Theorem 9.13, and the well-known result that a local maximum of a concave 
function must, in fact, be global. 

It follows that belongs to (the relative interior of) the convex hull of 
which is contrary to assumption. ^ 

The use of Theorem 10.5 may be illustrated through the following remarks 
which pertain to the case /c = 2. 

Suppose all the points of 36 are support points of the distributions of x (i.e. 3E 
contains no superfluous points). Every minimal canonical statistic t is a 
(nonsingular) affine transformation of x and hence any corresponds to a 
system of parallel lines in the range space, of x, namely the lines on which 

viewed as an affine function on is constant. For /c = 2 the result of Theorem 
10.5 may therefore be paraphrased as follows. Consider an arbitrary system of 
parallel lines such that each line contains at least one point of S, In order for the 
determined by this system to be a cut it is necessary that for every one of the 
lines, the smallest line segment containing all points of S on the line be equal to the 
intersection between the line and the closure of the convex hull of S. 

For instance, if 3E is a subset of Nq and if the points (0, 0), (0, 1), ( 1, 0), and (1, 1) all 
lie in 36 then it is easily seen from a diagram such as Figure 10.1 that, besides 
Xi,X 2 , and x^, which are well known to be cuts in certain cases, the only 
possibility for a cut is X 2 ~ x^. The latter may, however, be readily excluded by 
the next necessary condition. 



For simplicity, this second condition will be derived and stated for /c = 2 only, 
but it will be clear that the derivation and result are extendable to general k. 
Again, let ^ be regular and consider relation (5) of Section 10.2. Since fc = 2, the 
range of is an open interval of R, Assume that inf ©^^^ = — 00 and that for 
each given by (3) of Section 10.2, the set of points for which 
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e S is bounded below. These conditions are fulfilled, in particular, if 3E is 
finite. Then, letting ^ - oo in (5) of Section 10.2, one finds that/or t^^> to be a 
cut the points (r<‘\ min e must lie on a straight line, the minimum being 

taken over the values with t<^')eS. 

Example 10.6. The suppositions made to obtain the conclusion are satisfied in the 
situation illustrated by Figure 10.1 and for = xj - Xj. Therefore Xz - Xi is 
not a cut. ^ 

Example 10.7. Logistic dose-{binomial) response model. Figure 10.2 shows the set X 
and its convex hull for the case where x equals the minimal canonical statistic 
(s, vv) of the logistic, quantal response model with n= 1, d = 5 (cf. Example 9.14), 





the doses being placed at —2, — 1, 0, 1, 2. The second necessary condition above 
and a glance at Figure 10.2 immediately reveals that in this instance there are no 
cuts. ► 

In the remainder of this section X is assumed to be a subset of Nq . Thus ijJ is a 
power series family and the point probabilities of ^ may be written 

(4) p(x)X^/g(l) 

where 2 = = (e^^...,eH = and g is the generating 

function for the probabilities p(x) of Pq. 

Recall, from Example 8.4, that ^ is a sum-symmetric power series family if and 

only if g depends on 2 through 2 = 2i = (- 2^ only. In this case go , defined by 

go(h= (^(2), where I = 2 /n, is the generating function for the point probabilities 
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Pq{x.) of X. under and consequently 


(5) 


p(x) = Poix) 


xl 

Xi!...Xfc!* 


Hence the marginal distribution of x. is given by 

( 6 ) 


A probability distribution is called a singular multinomial distribution of order k 
and with numbering parameter n if it has support 

{(xi,...,Xk): XieAo(i= 4- • • • + x^ = n} 


and point probabilities 


n\ 




for some tc = (tci e Hq where 

(7) Ho = {(711,...,%): 7ti >0 (i = + • • • + = 1}. 

From (4) and (5) one finds: 

Theorem 10.6. Suppose ^ is a sum-symmetric power series family, let {x ^^\ . . . , x^'”^) 
be a partition of x and let and x^f denote, respectively, the dimension and the sum 
of the coordinates of x^^i = 1, . . . , m). 

Then x^^*^ = (x( ^ \ . . . , is a cut. Moreover, the family of distributions of xP is a 

sum-symmetric power series family while the family of conditional distributions of x 
given x^*^ is the product of the m singular multinomial families of orders 
and with trial parameters x[^\ ..., x^'”^ ^ 


Conversely, one has: 

Theorem 10.7 If x, is a cut and the family of conditional distributions of x given x is, 
for every value of x , the singular multinomial family of order k and with trial 
parameter x. then ^ is sum-symmetric. 

Proof Since x. is a cut, the class of marginal distributions of x. is a power series 
family having point probabilities 


( 8 ) 


PoM^olgoi^o) 


for some po and po- Thus the probability of x is 


X ^ 

Poix.) — — ri^o'^Y/goi^o) 

Xl I • ‘ • Xfe! 

which shows that is sum-symmetric. ► 
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Theorem 10,8. Suppose X contains the origin and the unit vectors (10 0) 

Then ^ is a sum-symmetric power series family if and only if x is a cut. 

Proof The only if assertion is a consequence of Theorem 10.6. 

Suppose X. is a cut. By (4) and (8) the probability that x. equals 0, respectively 
1, may be written 

(9) Po{0)l9oih) = p{0)lg{A\ 
respectively 

(10) Pomoigoao) = {TpiedVsigm 

where denotes the ith unit vector. Combining (9) and (10) one obtains 

(11) Xq = CiXi 4- • - • + CfcAfe 
for certain positive constants Ci , . . . , q. Set 

X = (CiXi,. . CfcAjt) 
b(x) = p(x)cr^‘“-Cfc"^^ 
g{X) = giX\ 

then the probability (4) of x can be expressed thus: 

b(x)l^g{X). 

According to (9) and (11), the function g depends on X only through ► 

Note that if ^ is a sum-symmetric power series family and if (x^^\x^^^) is a 
partition of x then the family of conditional distributions of x^^^ given x^^^ is also a 
sum-symmetric power series family and it depends on x^^^ through x[^\ the sum of 
the coordinates of x^^^, only. 

Theorem 10.9. Let {x^^\ x^^^) be a partition of x and suppose “ip is a sum-symmetric 
power series family such that the probability that x equals 1 is less than one. (In the 
case where this probability is one, ^ is the singular multinomial family with 
trial parameter 1.) 

Then x^^^ is a cut if and only if ^ is either the multinomial, the multivariate 
Poisson or the negative multinomial family. 

Proof It is straightforward to check the if assertion. 

Next, note that if the event that x = 0 or 1 has probability 1 then ^ is the 
multinomial family with trial parameter 1. Henceforth it is therefore assumed 
that the probability of this event is less than 1 . 

Let x^^^ be a cut and suppose first that the dimension of x^^^ is one. For 
conciseness, let p„(n = 0, 1, 2, . . .) denote the probability that x^^^ = n. The above 
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assumption implies > 0 for n = 0, 1, and 2, and since is a cut one has 

PoPilPi = c 


where c is a constant (independent of the parameters). Generating function 
technique shows that 

PoPilPi = i^o(>^o)^o('^o)/{^o('^o)}^ 
and hence, for certain constants d and c'\ 

g'oi^oVdoi^o) = l/ic' 4- c"Ao). 

It follows that 00 , which is the generating function for x. under Pq , corresponds to 
either a binomial (c" > 0), Poisson (c" - 0), or negative binomial (c" < 0) 
distribution. This implies the desired result. 

The general case, where the dimension of x^^Hs not supposed to be one, may be 
thrown back on the one-dimensional case by remarking that the class of 
distributions of (x[^ \ where x^^^ is the sum of the coordinates of is also a 
sum-symmetric power series family. ► 


10.4 S-ANCILLARITY AND EXPONENTIAL FAMILIES 

Many examples of S-ancillarity in exponential families have been indicated in the 
foregoing, so the discussion here will be confined to some remarks on maximal S- 
ancillary statistics and to an example, concerning the correlation coefficient 
of the two-dimensional normal distribution, in which Corollary 1 0.3 and Theorem 
10.2 are used to prove the non-existence of S-ancillary statistics. 

Corollary 10.3 and Theorem 10.2 show, in particular, that if ^ is open and 
convex and if there exists a cut, S-ancillary with respect to a parameter of interest 
ij/, whose size is the largest possible compatible with the rank of ij/ then this cut is 
maximal S-ancillary. However, a cut of largest possible size need not exist and if it 
does not there may be no unique maximal S-ancillary statistic. 

Example 10.8. For the two-by-two contingency table with the total fixed at n, a 
minimal canonical statistic is (xii,x i,Xi ). Suppose ij/ is the interaction para- 
meter, 

P12P2I 

Then x 1 and Xj. are both cuts, of size 1 and S-ancillary with respect to ij/. There 
exist no cuts of size 2 which are S-ancillary with respect to ij/, because in the 
contrary case (x.i,Xi.) would, by Theorem 10.2, be a cut, which it is not (as is 
obvious from the fact that x 1 and Xj are independent if and only if i/^ = 0). 
Moreover, x ^ and x^^ are relatively maximal, as will be verified below, and 
hence there exists no unique maximal cut, S-ancillary with respect to xj/. 
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Suppose that a cut, S-ancillary with respect to if/ and that in the u- 

algebra notation introduced in the beginning of Section 4.2, 

( 1 ) 

By Lemma 10.1 there exist constants a, b, and c such that 

= aX ^+bX, 4- c. 

Obviously one may assume that a = 1 and c = 0, so 

P^^ = X I FbXi 

In order to prove that X i is relatively maximal it suffices to prove that h = 0. 
Suppose first that the points 

Cl + bc2, 

where Cj and Ci belong to {0, l,...,n}, are all different. Then 

f^^Xx) = Cl -h bc2 
o {X ,{xlXi {x))=^{cuC2) 

which implies that the correspondence between f and {X j , ) is one-to-one 

Consequently f not a cut. 

Now, assume there exist x and x' such that {X,^(x), Xi.M) (.x'), X^ ix')) 

and 


X i{x) + bXi (X) = X i(x') + bXi (x') 

which means that 

ji^Xx) = f<'>(x'). 

Hence, on account of (1) 

Xiix) = Xi(x') 

and thus 

b(Xi (x) - Zi (X')) = 0. 

It follows that h == 0. 

By symmetry Xi must also be a relatively maximal cut, S-ancillary with 
respect to ij/, ► 

Example 10.9. Let(Xji,Xi 2 ), i=l,2,...,n, be n independent and identically 
distributed two-dimensional normal variates with mean value (Ci,<J 2 ) 
variance matrix 

/)(Ti(T2\ 

PO’lO-2 ^2 / 
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The family of joint distributions of these variates is regular exponential of order 5 
and has an exponential representation with 

t = (I xfuY. Xa, y XnXa,^ Xi2 , 1 x^2) 


as a minim al canonical statistic. The corresponding minimal canonical para- 
meter is given by the following equations 


0i 

62 

03 

04 

05 


-1 

2(1 - p Vf 

P^2 

(1 - pVi (1 - pVi«^2 

p 

(1 - P VlO'2 

^2 P^l 

(1 - p V 2 (1 - P VlO'2 

-1 

2(1 - pvr 


The aim of the present example is to prove that there exists no cut, S-ancillary 
with respect to the correlation coefficient p. 

One has 


p == m = 


03 

2^(0105) 


and hence ij/ is of rank 3. If there exists a cut, S-ancillary with respect to p, its size 
must therefore be 1 or 2 , 

Suppose there is such a cut, of order 2. Theorem 10.2 implies that (r 2 , 4 ) is 
(equivalent to) this cut, i.e. that the family of marginal distributions of (t 2 , 4 ) is 
exponential of order 2. But (4, 4) = so a contradiction has 

obviously been reached. 

Next, suppose there exists a cut of size 1, S-ancillary with respect to ij/. On 
account of Corollary 10.2 it causes no loss of generality to assume that this cut is a 
one-dimensional component of a minimal canonical statistic i. Let 
9 "" 0^^^) denote the corresponding minimal canonical parameter. Since is 

S-ancillary with respect to if, the parameter \j/ depends on 0 only through 
Thus, on account of Lemma 10.1, is an affine transformation of ( 4 , 4 ). This 
implies that is normally distributed and hence the exponential family 
corresponding to the marginal distribution of is of order 2 , which contradicts 
the fact that the size of ?^^Ms L ► 
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10.5 M-ANCILLARITY AND EXPONENTIAL FAMILIES 

A number of examples of M-ancillarity in exponential families have been given in 
Sections 4.1 and 4.4. Further examples are presented below. Moreover, a 
necessary condition for M-ancillarity in discrete type families will be established 
and illustrated. This condition is similar to the necessary condition for S- 
ancillarity provided by Theorem 10.5. It will be demonstrated that if the necessary 
condition to be presented is not fulfilled then, as a rule, some value (or values) of 
the statistic in question gives strong evidence against certain extreme values of the 
interest parameter. Except for Example 10.10, the families considered in this 
section are all of discrete type. 

For each of the instances of M-anciilarity mentioned in the following the 
required property of universality is established by first proving that the marginal 
distribution of the relevant component is strongly unimodal and then 
invoking one of Corollaries 9.9 or 9.10. 

Corollary 10.3 and Theorem 10,2 should be kept in mind in the following. They 
imply, in particular, maximality of all the M-ancillary statistics discussed. 

The two final examples of the section contain instances of pointwise M- 
ancillarity. 

Example 10.10. System reliability for components in series. Suppose a certain 
system consists of two components operating in series and having life lengths 
which follow independent, exponential distributions with mean values and 
^ 2 , respectively. The probability that the system fails before time t is 
1 — exp{ — where 

Suppose, moreover, that the life lengths of items of the type used for the ith 
component, i = 1, 2, have been recorded in order to draw inference on The sums 
Xi and X 2 , say, of the observed lifelengths are gamma distributed with parameters 
and {fujzl 

The joint distribution of and ^2 is strongly unimodal and hence on account 
of Theorem 6.4 and Corollary 9.10, the difference x^ — X 2 is M-ancillary with 
respect to ij/. (It may be noted that x^ — X 2 is not G-ancillary.) An explicit 
expression for the conditional distribution given — X 2 may be found in 
Lentner and Buehler (1963). ^ 

The kind of argument for M-ancillarity given in Example 10.10 goes to show 
the following general result. If is full, linear, of continuous type, and strongly 
unimodal then for any pair t^^^ and 6^^^ the statistic is M-ancillary with respect 
to (Furthermore, under minor additional smoothness assumptions, con- 
ditions (iv) and (v) of Theorem 4.7 will be satisfied, cf. Theorem 9.27.) 

Example 10.11. System reliability for components in parallel With two com- 
ponents in parallel, functioning independently and failing with probabilities p^ 



212 Inferential Separation and Exponential Families 

and probability of failure of the system is pj> 2 ' Suppose p^ and p^ are small 

and that the components have been tested separately in, respectively, n^ and 
Bernoulli trials, where n^ and ^2 are comparatively large. Then the problem of 
drawing inference on PiP 2 from the observed numbers of failures Xi and X 2 may 
be treated, approximately, as if x^ followed a Poisson distribution with mean 
value Xi (i = 1,2) and as if the interest parameter was \j/ == X 1 X 2 . 

in the formulation with Poisson variation, x^ — X 2 is M-ancillary with respect 
to ij/, by Theorem 6.6 and Corollary 9.9. (As is well known, the distribution of 
Xi - X 2 , and hence also the distribution of x^ given X 2 - Xi, is expressible in 
terms of the modified Bessel functions Iv.) ^ 

Examples 10.122 x 2 contingency tables. If X:, is given then x ^ is M-ancillary with 
respect to the interaction parameter. In fact, the distribution of x ^ is strongly 
unimodal because it is the convolution of the distributions of x^ and X 21 which 
are binomial and hence strongly unimodal; now apply Corollary 9.9. 

Suppose next that only the total x ^ is given. The distribution of (x ^ , x^ ) is then 
the X -fold convolution of a distribution on {0, 1}^ and this imphes that it is 
strongly unimodal, see Pedersen (1975a). Consequently, (x ^ x^ ) is M-ancillary 
with respect to the interaction parameter. ^ 

Example 10.13. 2x2x2 contingency tables. Pedersen (1975b) established a 
number of cases of M-ancillarity in2x2x2x2 tables. His results, some of 
which were obtained by quite intricate proofs (of strong unimodality), are 
summarized in Table 10.1, in which the second column indicates which inter- 
action parameters are assumed to be zero as part of the model specification. 
There are, if x... is given, one second order interaction 

^ , P 111 P 221 P 122 P 212 

0123 — In 

P 121 P 211 P 112 P 222 


Table 10.1. Interest parameters and corresponding M-ancillary statistics, for various 
contingency table models 


Interest 

parameter 


Submodel 


X X x^^^ 

given given given 


M-ancillary statistic 


0123 


(Xii_.?Ci i,Xi ) 


O 12 

0123 — 0 

{Xi i,x,. )■■ 


(0123? 012) 




(012? 013? 023) 

0123 = 0 

• — 

(^1 n X 1 ) 

012 

0123 ~ 013 = 023 = 0 

— 

v',’) 
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and three first order interactions 

Pi22jP212 P 122 P 22 I 

P2I2P22I 

Table 10.1 gives the minimal canonical statistic corresponding to fixed values 
of the interest parameter, provided this statistic has been proved to be M- 
ancillary. An entry is empty if the question of M-ancillarity is, so far, undecided, 
and it contains a bar ( — ) if the question has no meaning in the model concerned. 

It was furthermore observed by Pedersen (1975b) that for the table with given 
total x the set of marginals (xj ,x i.,x 1) is not M-ancillary with respect to 
(012,^13. ^235 ^123)5 see Example iO.i? below. ► 

Example 10.14. Logistic dose-(hinomial) response model (Cf. Examples 9.14 and 
10.7). The statistic s = ^ Ui, being a sum of independent binomial variates, has a 
strongly unimodal distribution and is hence M-ancillary with respect to 
(whatever the number and placing of the doses and whatever the number of 
animals per dose). 

With three doses placed at ~ 1, 0, 1, the other component w equals Ui — 1 . It 

follows (using again Theorem 6.6 and Corollary 9.9) that w is M-ancillary with 
respect to a. However, this is an exceptional case; in general w is not M-ancillary, 
cf. Example 10.18. ► 

The following theorem and its corollary, together, are to a large extent 
analogous to Theorem 10.5. 

Theorem 10,10. Suppose ip is of discrete type and that 0 is of the form 
(1) 0 = 0<^^ X 

Let be one of the possible values ofT^^^ and let be the set of points in S having 

first component equal to t^^\ 

If there exists a point in the convex hull of S which has t^^^ as first component and 
which does not belong to the closure of the convex hull of then there exists a 
convex cone K in such that for and co 

_> 0 uniformly in 

Proof ip has point probabilities 

Pe{T = t} = a(0)b(t)e^'^ 

Set = ciconVi?^^^ and suppose to is a point such that 4^^ = and 
to e com S\C^'\ On account of Caratheodory’s convexity theorem there exists an 
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integer m with 1 ^ m < /c + 1, points ti,...,t„eS and positive numbers 
li, . . . , such that Ai + • • • + = 1 and 

(2) to = Aitj + • • • + X„t„. 

Set 

Mid) = / = 1, . . m| 

and 


One has 


Now, 






M(0) > n 

i = 1 




= exp 






Equation (2) says, in particular, that 


and consequently 


It is obviously that 


I 

i^l 


M(0) > n 

i = l 






0(2) .j(2) 


and hence, using (2) again, 


i = 1, ...,m 


It follows that 


m 


M(0) ^ n bitif' 

i=l 


w^y 


M(e) > Cop(0<^>) 
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where 


and 


m 


n 

t = 1 





sup 

s<2)eS,(‘> 




Define K by 


K — inf > 0}. 

S(2)SS,‘'’ 

The fact that ro^C'“’ implies that K is non-empty, and obviously 

AK <=K for ;i > 0 


and 

AK + (l-A)Kc:K for/le(0,l), 


so that X is a convex cone. 

Now, if 0‘^’eX and -*■ oo then M(0) -> oo and this implies that 
jd) _ ^(1)} _> 0. The convergence is uniform in 0'^* as p does not depend on 
0(1). ► 

In cases where it is of interest to find K one may use that K is an intersection of 
halfspaces, namely 

K=n {0<i);0'2)-(t(f)-s‘"')>O}. 

sd)eS,<‘) 

Corollary 10.5. Suppose S is finite and ip is full. 

If for one of the possible values t'D of there exists a point in the convex hull of 
S which has first component t'D and which does not belong to the closed convex hull 
of S’"' ’ then is not M -ancillary with respect to 

Proof For and 10'^)1 large, t'D is not a mode point for the family of 

distributions of TID. ^ 

In fact, for S finite. Theorem 10.10 implies not only that t'D is not a mode point 
for distributions of with and |0(^)| large, but that the event 

^ jti) ^ considered separately, affords strong evidence against such values of 
0(^) (both from the likefihood viewpoint and the plausibility viewpoint). 

The illustrative remark concerning Theorem 10.5, made right after the proof of 
that result, applies mutatis mutandis to CoroUary 10.5. In particular, suppose that 
ip is linear and regular, that 3E is a finite subset of No, and that the points (0,0X 
(0, 1), (1, 0), (1, 1) all have positive probability. Then the only statistics which can 
possibly be M-ancillary, with respect to the complementary parameter function 
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which induces the same partition of ^ as does the conditional distributions 
given the statistic, are Xi,X 2 ,Xi + X 2 , and Xi — X2 . (Unlike the case in respect of 
S-ancillarity, the latter statistic Xi — X 2 may, in fact, be M-andllary 

Example 10,15. For the 2 x 2 contingency table with one margin fixed, shown on 
p. 36, the only statistics which are M-ancillary, with respect to their com- 
plementary parameter function, are Xi, X2,Xi + X 2 , and Xi - X2 (the parameter 
function corresponding to Xi — X2 being ln({piP2}/{^i^2}))- ► 

Example 10.16. The 2x2x2 contingency table with given total n. For n>3, the 
set of marginals (xi. , x 1 , x,,i) is not M-ancillary with respect to the set of first 
and second order interactions (6i2> 0139^235^123) (m the notation of Example 
10.13). 

To show that Corollary 10.5 applies, let 1 

= (Xni,X and t^^^ = (1, 1, 1). One has 

5(1, 1.1) ^(2)). ^(2) ^ ^ 

( 0 , 0 , 1 , 0 ),( 0 , 1 , 0 , 0 ),( 1 , 1 , 1 , 1 )}. 


Moreover, (0, 0, 0, 0, 0, 0, 0)6 5, (2, 2, 2, 0, 1, 1, 1)6 5 and hence 

(1, 1, 1, 0, i,i,^)econv5. 


But this latter point does not belong to cl conv and hence the supposition 
in Corollary 10.5 are fulfilled. 

The above result was given by Pedersen (1975b). ^ 

It was, furthermore, pointed out by Pedersen (1975b) that the result in 
Example 10.16 together with Theorem 10.10 simply implies that for every n>3 
there exists a strongly unimodal distribution on {0, 1}^ such that its n-fold 
convolution is not strongly unimodal. 

Example 10.17. Genotype distribution and selection, n individuals from a diploid 
population are classified according to genotype at a single, diallelic locus. Denote 
the genes by A and a, and let Xj , X2 , X3 be the number of individuals of genotypes 
AA, Aa, aa, respectively. Supposing that the n individuals form a random sample 
from an infinite population in which the frequencies of the three genotypes are 
Pi,P 2 , and P3, the distribution of (xi,X2,X3) is the trinomial 


nl 


Xi!x2!x3! 




This expression may be written in the exponential form 


a(0)p(z,X2)e^^^'"^2X2 
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where z = 2xi + X2 is the number of A genes in the sample and where 


2 Pi 



Pi 

4plpi' 


As discussed in Example 8.7, 62 is a meaningful indicator of deviation from 
Hardy-Weinberg distribution, especially when such deviation is caused by 
zygotic selection. Let x = (z, Xj), then 


3£ — {(z, X2): z = 0, 1, . . . , 2«; Xj = 0, 2, . . . , min {z, 2 n — z} for z even, 

X Xj = 1, 3, . . . , min {z, 2 n — z} for z odd], 


see Figure 10.3. 



Figure 103 The set 3tof values of { 2 ^X 2 ) and the convex hull C ofXi for n = 4. 

Each of the statistics ^2, and X3 is M-anciilary (and, incidentally, also S- 
ancillary), with respect to In(p2/P3), etc. It is evident from Figure 10.3 (and 
Corollary 10.5) that there is no other statistic which is M-ancillary with respect to 
its complementary parameter function. 

In particular, z is not M-ancillary with respect to di, and if an odd value of z is 
observed this, by itself, is strong evidence against negative, numerically large, 
values of 62* Thus, if 82 is the interest parameter and if one were to draw the 
inference solely in the conditional distribution of X 2 given z (which, of course, 
depends on 62 only) then one would, provided z was odd, be ignoring available 
information on 82 . It seems, however, a reasonable conjecture that any even value 
of z is pointwise M-nonformative with respect to 02* 
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Example 10.19. Let /c = 2 and let S consist of the origin and all points of the form 
(e'I’/w, where = + 1, = ± 1 and n = 1, 2, . . . . Take ip to be the family 

{P 0 : 06©} with 

Pe{0} = a(9)^ 

Pelt} = «(0)^e»-' for t = 

f(0) = a(0)“^ = - + - Z Z 

2 8 g(i)^£(2) 

and 


e = Rx (~ln2. In 2). 


Clearly, ^ is regular. 

Here, for = 0, any point (0, with ^ 0 belongs to conv S but not to cl 
conViS^^^^ = {0}. However 


P,{T<i) = £a)/n} 
Pe{T^^^^0} 






and thus for every 0^^^e(~ln2, In 2), = 0 is a mode point for 


10.7 NOTES 

When specialized to S-ancillarity the results in Section 10.1 coincide with 
material presented in Barndorif-Nielsen (1973a) and Barndorff-Nielsen and 
Blaesild (1975). The contents of Sections 10.2-10.4 stem from the same two 
papers and from Barndorif-Nielsen (1976a). However, Theorems 10.6-10.9 are 
but a paraphrasing of results due to Joshi and Patil (1970, 1971). Theorem 10.10 
and Corollary 10.5 constitute an extension of the main conclusion in 
Barndorff-Nielsen and Kvist (1974). 

As mentioned at the end of Section 4.1, the maximum likelihood estimator of 
an interest parameter ij/ will in general not even be consistent if the number of 
incidental parameters tends to infinity together with the sample size n, but in 
important cases, primarily under exponential models, the incidental parameters 
may be eliminated by a suitable conditioning and the conditional maximum 
likelihood estimator is both consistent and asymptotically normal. Subject to 
fairly mild regularity conditions it is possible to show that the maximum 
plausibility estimate of whether conditional or unconditional (often these are 
the same, cf. Section 4.6), differs from the conditional maximum likelihood 
estimate by a quantity which is of the order of only — cf. Hbglund (1974), 

and also Section 9.9. 
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Lods functions 1-2, 7, 9, 13, 19-23, 30, 
109, 181 
b- 13,21-23 
equivalence of 21-22 
f- 13, 21-23 
linear 23 
normed 21 

Logarithmic distribution 127 
multivariate 118 
Log-concavity, of a function 93 
of a probability measure 95 
Log-convexity 100-101 
Logistic dose-(binomial) response model 
156-158, 161, 205, 213, 218 


Marginal family of an exponential family 
127-129 

Markov kernel 5, 38-39, 45-46 
Markov process 29-30 
Martin-Lof 177 
Maximal invariants 52 
Maximum b-lods estimation 22-23 
Maximum entropy 137 
Maximum likelihood 8 
Maximum likelihood estimation 13, 15, 
22, 27, 30-31, 37, 62, 189 
conditional 37-38, 58-59, 219 
in exponential families 137, 150-164, 
175-177, 183-188, 190 
of sub-parameters 62, 156, 158 
Maximum likelihood prediction 24, 172 
Maximum plausibility estimation 13, 15, 
17, 22, 58-62, 189, 219 
conditional 58-62, 219 
in exponential families 168-170, 188, 
190 

Maximum plausibility prediction 24, 
171-173 


Mean value, mapping 121, 188 
parametrization 121 
Minimal sufficient (r-algebra 69 
Minimum discrimination information 
137 

von Mises-Fisher distribution 53, 113- 
114, 130 

Mixed parametrization 121-122, 148- 
149, 183 

Mixture experiment 34-35, 69 
Mode, mapping 17, 168, 188 
point 11,16, 24-25, 48-49, 60-61 
point of family of conditional probability 
functions 12 

point of family of probability functions 
11, 13, 15, 59 

point of probability function 1 1 
size, constant 11-13 
Model control 62, 133 
Model function 7, 23 
Multinomial distribution 26-27, 61, 100, 
107, 118, 132, 207 
Multiple recapture census 135 


Negative binomial distribution 15, 28, 
56, 100, 128, 185, 202 
Negative multinomial distribution 28, 
100, 107, 118, 207 
Neyman-Pearson 68 
Nonformation 1-2, 33-35, 37-38, 46-48, 
50, 56, 65, 70 
B- 35,47-48 
G- 35, 47, 49, 52, 55 
M-B- 65 
M- 35, 47-49, 217 
pointwise 46 
pointwise B- 47-48, 66 
pointwise M- 47-49, 218 
pointwise S- 47-49 
principle of 34-35 
S- 35-36, 47-49 
Normal cone 73-74 
Normal cone mapping 74 
Normal distribution 8, 52, 62-63, 97 
119, 130, 132, 193 

multivariate 7, 28, 37, 97, 104 107, 
116, 119, 122, 126, 146, 151- 154, 
209 


Normal vector to a convex set 73 



Ods function 1,22,62 
b- 22-23 
f- 22-23 
normed 22 
Orbit 4,52 

Order of an exponential family, see Ex- 
ponential family 
Orderings, b- 20 
f- 20 

Open exponential family, see Exponential 
family 

Open kernel 1 1 7-1 1 8 

Orthogonal parameters 30-31, 182-185 

Parameter function 6 
Parametrization 6, (see also Canonical; 

Mean value; Mixed) 

Pareto distribution 1 52 
Partially observed exponential situation 
176-177 
Perfect fit 48 

Plausibility 1-2, 6, 11,17, 21, 48, 215, 218, 
(see also Maximum plausibility esti- 
mation; Plausibility function Pre- 
diction) 

ratio test, see Test 

Plausibility function 1, 7, 11-16, 15, 19, 
21-22, 58-60, 62, 169, (see also 
Plausibility) 
conditional 58-60 
log- 9, 20, 21-23, 140 
log-plausibility function in exponential 
families 143-144, 168-170 
liormed 13-14, 16 
sup-log- 140 

Poisson distribution 27-28, 57, 100-101, 
107, 115, 118, 132-134, 142, 149, 180 
195-196, 202, 207, 212 
unimodality of mixtures of 101 
Poisson process 48 
with log-linear trend 52 
Poisson regression, see Regression an- 
alysis 

Polar of a convex cone 74 
Polyhedral concave function, see Concave 
functions 

Polyhedral convex function, see Convex 
function 

Polyhedral convex set, see Convex sets 
Polytope 76 

Power series family 118, 205, (see also 
Sum-symmetric) 


Subject Index 237 

Precision in( conditional iinfercnce 34™ 36 
of a (multivariate I normal distribution 

7 

of a von Mises-Fisher distribution 1 1 3 
Prediction 11 , (see also Maximum likeli- 
hood prediction; Maximum plausi- 
bility prediction: Prediction function I 
Prediction function 1 , 20, 23, 3 1 , (see also 
Prediction) 

likelihood prediction 24-25, 31 
likelihood prediction, for exponential 
families 170-172 
plausibility prediction 24-25, 31 
plausibility prediction, for exponential 
families 171-173 

Probability 20, (see also Probability func- 
tion) 

Probability function 1 1-13, 19, 21-22, 25, 
30, (see also Probability) 
log- 9,20,23,139 

log-probability functions in exponential 
families 164-168, 177-182 
Probability measure, of c-discrete type 6 
of continuous type 6 
of discrete type 6 
Product exponential family 127 

Quasi-ancillarity, see Ancillarity 
Quasi-convex function 77, 96, 98 
Quasi-sufficiency, see Sufficiency 

Rank of a mapping 191 
Realizable, sample points 6 
values of a statistic 6 
Recession, cone 74, 82 
function 79, 83 
Recombination 44, 54-55 
Reference set 70 
Regression analysis, linear 36 
of lifetime data 53 
Poisson 51 

Regular exponential family 1 1 6-1 17, 126, 
^114, 149, 153, 200, 203-204, 215 
example of non- 117 
Relative boundary of convex set, see 
Convex sets 

Relative interior of convex set, see Convex 
sets 

Relatively maximal ancillary statistic, see 
Ancillary statistic 
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Resultant density for a von Mises-Fisher 
distribution 128-129 
Resultant length density for a von 
Mises-Fisher distribution 129 
Rockafeller 89 

Sample aspect 19 
Schur concavity 101 
Separable cr-algebra 40 
Separate inference 2, 9, 33-34, 36, 68 
Separating hyperplane 73 
properly 73 
strongly 73 

Separation, inferential 2, 7, 9, 11, 33, 
37-38, 46, 56, 69 

Singular distribution, see Distributions 
Singular multinomial distribution 206 
Size of a quasi-ancillary statistic 192 
Skewness 119, 179 
Skitovic-Darmois theorem 67 
Stable distributions 104, 117, 142, 184 
Standard representation for an expo- 
nential family 115 
minimal 115 
Statistical field 5 

Statistical information, see Information 
Statistical mechanics 8, 137 
Steep convex function, see Convex func- 
tion 

Steep exponential family, see Exponential 
family 

Stopping rule 14,65 
Subdifferential of a convex function 84 
Subgradient of a convex function 84 
Sufficiency 1-2, 8, 9, 33, 48-50, 52, 57, 
64-65, 68-70, 109, 133, 136-137, 152, 
191, {see also Sufficient statistic) 
axiom 65 

B- 35, 37-38, 43, 45, 49, 65-66, 69 
G- 37, 52-55, 67, 70 
M- 37, 52, 55, 70 
principle of 35 
quasi- 57-58, 70 
S- 48-51,56,66,70 
Sufficient statistic 9, 33, 35, 38, 69, 191, 
{see also Sufficiency) 
minimal 12, 1 5, 22, 56-58, 69, 1 1 1, 126 
minimal B- 38, 40-45, 58 
Sum-symmetric power series family 118, 
205-208 


Support function of a convex set 83 
Support of a probability measure 90 
Supremum of collection of convex func- 
tions 80 

System reliability 211-212 

Test, conditional 68 
likelihood ratio 140,190 
plausibility ratio 140 
similar 68-69 
Transformations 4-6, 30-31 
group of 4-6, 52-54 
normalizing 30, 177-182 
spread-stabilizing 30, 177-182 
transitive class of 4-5, 11, 13, 47 
unitary class of 4-5, 52 
variance-stabilizing 30, 179 
Transitive class of transformations, see 
Transformations 
Trinomial distribution 99, 195 
Truncated family of exponential family 
130-131 

Unimodality 2, 11, 49, 71, {see also 
Exponential family) 
of continuous type distributions 96-98, 
101 

of discrete type distributions 98-100 
strong 49, 107, 144, 211-212, 216, 218 
strong unimodality of continuous type 
distributions 96-98, 101 
strong unimodality of discrete type 
distributions 98-100 
Uniqueness in statistical inference 56-57, 
65 

non- 2,37,65-68 

Unitary class of transformations, see 
Transformations 

Universality 11-13, 25, 47, 49, 60, 63, 169, 
172, 21 1, {see also Exponential family) 
strict 11, 60 


Variation independence, see Independence 


Wedderburn 179 

Wishart distribution 98, 107, 128, 

149-150 



